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Preface 



This volume contains papers presented at the Fourth European Conference on 
Computational Learning Theory, which was held at Nordkirchen Castle, in Nord- 
kirchen, NRW, Germany, from March 29 to 31, 1999. This conference is the 
fourth in a series of bi-annual conferences established in 1993. 

The EuroCOLT conferences are focused on the analysis of learning algorithms 
and the theory of machine learning, and bring together researchers from a wide 
variety of related fields. Some of the issues and topics that are addressed include 
the sample and computational complexity of learning specific model classes, 
frameworks modeling the interaction between the learner, teacher and the envi- 
ronment (such as learning with queries, learning control policies and inductive 
inference), learning with complex models (such as decision trees, neural networks, 
and support vector machines), learning with minimal prior assumptions (such 
as mistake-bound models, universal prediction, and agnostic learning), and the 
study of model selection techniques. We hope that these conferences stimulate 
an interdisciplinary scientific interaction that will be fruitful in all represented 
fields. 

Thirty-five papers were submitted to the program committee for considera- 
tion, and twenty-one of these were accepted for presentation at the conference 
and publication in these proceedings. In addition, Robert Schapire (AT & T 
Labs), and Richard Sutton (AT & T Labs) were invited to give lectures and 
contribute a written version to these proceedings. There were a number of other 
joint events including a banquet and an excursion to Munster. The IFIP WG 
1.4 Scholarship was awarded to Andras Antos for his paper “Lower bounds on 
the rate of convergence of nonparametric pattern recognition” . 

The EuroCOLT ’99 conference was sponsored by the DFG (Deutsche For- 
schungsgemeinschaft) and by IFIP through WG 1.4. 

We want to thank everybody who helped to make this meeting possible: the 
authors for submitting papers, the Program Committee and referees for their 
effort in composing the program, the Steering Committee, the sponsors, the 
local organizers, and Springer- Verlag. Thanks to Norbert Klasner and Karsten 
Tinnefeld for their help in preparing the proceedings, and thanks to Margret 
Vaupel for her administrative support. The program committee wishes to thank 
everybody who acted as subreferee. 
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Theoretical Views of Boosting 



Robert E. Schapire 

AT&T Labs, Shannon Laboratory 
180 Park Avenue, Room A279, Florham Park, NJ 07932, USA 



Abstract. Boosting is a general method for improving the accuracy of 
any given learning algorithm. Focusing primarily on the AdaBoost algo- 
rithm, we briefly survey theoretical work on boosting including analyses 
of AdaBoost’s training error and generalization error, connections be- 
tween boosting and game theory, methods of estimating probabilities 
using boosting, and extensions of AdaBoost for multiclass classification 
problems. We also briefly mention some empirical work. 



Background 

Boosting is a general method which attempts to “boost” the accuracy of any 
given learning algorithm. Kearns and Valiant |2 1 122j were the first to pose the 
question of whether a “weak” learning algorithm which performs just slightly 
better than random guessing in Valiant’s PAC model m can be “boosted” 
into an arbitrarily accurate “strong” learning algorithm. Schapire m came up 
with the first provable polynomial-time boosting algorithm in 1989. A year later, 
Freund m developed a much more efficient boosting algorithm which, although 
optimal in a certain sense, nevertheless suffered from certain practical drawbacks. 
The first experiments with these early boosting algorithms were carried out by 
Drucker, Schapire and Simard on an OCR task. 



AdaBoost 

The AdaBoost algorithm, introduced in 1995 by Freund and Schapire US!, solved 
many of the practical difficulties of the earlier boosting algorithms, and is the 
focus of this paper. Pseudocode for AdaBoost is given in Fig. Ein the slightly 
generalized form given by Schapire and Singer nq. The algorithm takes as input 
a training set (xi,yi), . . . , {xm,ym) where each Xi belongs to some domain or 
instance space X, and each label yi is in some label set Y. For most of this 
paper, we assume Y = { — 1,-|-1}; later, we discuss extensions to the multiclass 
case. AdaBoost calls a given weak or base learning algorithm repeatedly in a 
series of rounds t = 1,...,T. One of the main ideas of the algorithm is to 
maintain a distribution or set of weights over the training set. The weight of 
this distribution on training example i on round t is denoted Dt{i). Initially, all 
weights are set equally, but on each round, the weights of incorrectly classified 
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Given: {xi,yi), {xm,ym) where Xi e X, yi eY = {-1,+1} 
Initialize Di(i) = 1/m. 

For t = 1, . . . ,T: 



— Train weak learner using distribution Dt- 

— Get weak hypothesis ht X ^ 'R. 

— Ghoose at G K. 

— Update: 






Dt{i) ex^{-atyiht{xi)) 
Zt 



where Zt is a normalization factor (chosen so that Dt+i will be a distribu- 
tion). 



Output the final hypothesis: 



H {x) = sign 



y^athtjx) 



Fig. 1. The boosting algorithm AdaBoost. 



examples are increased so that the weak learner is forced to focus on the hard 
examples in the training set. 

The weak learner’s job is to find a weak hypothesis ht : X —>■ R appropriate 
for the distribution Dt- In the simplest case, the range of each ht is binary, i.e., 
restricted to { — the weak learner’s job then is to minimize the error 



Ci — [/i^(xj) yf yt\ . 



Once the weak hypothesis ht has been received, AdaBoost chooses a param- 
eter at G K which intuitively measures the importance that it assigns to ht- In 
the figure, we have deliberately left the choice of at unspecified. For binary ht, 
we typically set 

at = i In . (1) 

More on choosing at follows below. The distribution Dt is then updated using 
the rule shown in the figure. The final hypothesis iJ is a weighted majority vote 
of the T weak hypotheses where at is the weight assigned to ht- 



Analyzing the training error 

The most basic theoretical property of AdaBoost concerns its ability to reduce 
the training error. Specifically, Schapire and Singer in generalizing a theorem 
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of Freund and Schapire m, show that the training error of the final hypothesis 
is bounded as follows: 

— \{i:H{xi)^y^}\ < —^exp{-yj{xi)) = Y[Zt ( 2 ) 

i t 

where f{x) = '^fatht{x) so that H{x) = sign(/(a;)). The inequality follows 
from the fact that > 1 if ?/i 7 ^ H{xi). The equality can be proved 

straightforwardly by unraveling the recursive definition of Dt- 

Eq. (|2I) suggests that the training error can be reduced most rapidly (in a 
greedy way) by choosing at and ht on each round to minimize 

Zt = ^ A(*)exp(-at?/j/it(a;j)). 
i 

In the case of binary hypotheses, this leads to the choice of a* given in Eq. dU 
and gives a bound on the training error of 



[ 2 a/ et(l - et) 



n \/l - 47 ? < exp 




where et = 1/2 — jf This bound was first proved by Freund and Schapire m- 
Thus, if each weak hypothesis is slightly better than random so that jt is bound- 
ed away from zero, then the training error drops exponentially fast. This bound, 
combined with the bounds on generalization error given below prove that Ada- 
Boost is indeed a boosting algorithm in the sense that it can efficiently convert 
a weak learning algorithm (which can always generate a hypothesis with a weak 
edge for any distribution) into a strong learning algorithm (which can generate 
a hypothesis with an arbitrarily low error rate, given sufficient data). 

Eq. (|2) points to the fact that, at heart, AdaBoost is a procedure for finding 
a linear combination / of weak hypotheses which attempts to minimize 



^exp(-yj(a:i)) = ^ 

i i 




y^^athtjxi) 



(3) 



Essentially, on each round, AdaBoost chooses ht (by calling the weak learner) 
and then sets at to add one more term to the acculating weighted sum of weak 
hypotheses in such a way that the sum of exponentials above will be maximally 
reduced. In other words, AdaBoost is doing a kind of steepest descent search to 
minimize Eq. 0 where the search is constrained at each step to follow coordinate 
directions (where we identify coordinates with the weights assigned to weak 
hypotheses). 

Schapire and Singer m discuss the choice of at and ht in the case that ht 
is real- valued (rather than binary). In this case, ht{x) can be interpreted as a 
“confidence-rated prediction” in which the sign of ht{x) is the predicted label, 
while the magnitude |h((a:)| gives a measure of confidence. 
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Generalization error 

Freund and Schapire m showed how to bound the generalization error of the 
final hypothesis in terms of its training error, the size m of the sample, the 
VC-dimension d of the weak hypothesis space and the number of rounds T of 
boosting. Specifically, they used techniques from Baum and Haussler {Hj to show 
that the generalization error, with high probability, is at most 



Pr [H (x) yf y] + O 




where Pr [•] denotes empirical probability on the training sample. This bound 
suggests that boosting will overfit if run for too many rounds, i.e., as T becomes 
large. In fact, this sometimes does happen. However, in early experiments, several 
authors mm observed empirically that boosting often does not overfit, even 
when run for thousands of rounds. Moreover, it was observed that AdaBoost 
would sometimes continue to drive down the generalization error long after the 
training error had reached zero, clearly contradicting the spirit of the bound 
above. For instance, the left side of Fig. Q shows the training and test curves of 
running boosting on top of Quinlan’s C4.5 decision-tree learning algorithm m 
on the “letter” dataset. 

In response to these empirical findings, Schapire et al. following the 
work of Bartlett PJ; gave an alternative analysis in terms of the margins of the 
training examples. The margin of example {x, y) is defined to be 

y'^athtix) 

t 

t 



It is a number in [-1,-1- 1] which is positive if and only if H correctly classifies 
the example. Moreover, as before, the magnitude of the margin can be inter- 
preted as a measure of confidence in the prediction. Schapire et al. proved that 
larger margins on the training set translate into a superior upper bound on the 
generalization error. Specifically, the generalization error is at most 



Pr [margin^ (a;, y) < 0] 4- O 




for any 0 > 0 with high probability. Note that this bound is entirely independent 
of T, the number of rounds of boosting. In addition, Schapire et al. proved that 
boosting is particularly aggressive at reducing the margin (in a quantifiable 
sense) since it concentrates on the examples with the smallest margins (whether 
positive or negative). Boosting’s effect on the margins can be seen empirically, 
for instance, on the right side of Fig. El which shows the cumulative distribution 
of margins of the training examples on the “letter” dataset. In this case, even 
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^ rounds margin 



Fig. 2. Error curves and the margin distribution graph for boosting C4.5 on the letter 
dataset as reported by Schapire et al. m- Left-, the training and test error curves 
(lower and upper curves, respectively) of the combined classiher as a function of the 
number of rounds of boosting. The horizontal lines indicate the test error rate of the 
base classifier as well as the test error of the final combined classifier. Right: The 
cumulative distribution of margins of the training examples after 5, 100 and 1000 
iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves, 
respectively. 



after the training error reaches zero, boosting continues to increase the margins 
of the training examples effecting a corresponding drop in the test error. 

Attempts (not always successful) to use the insights gleaned from the theory 
of margins have been made by several authors |5I1 Qf24) . In addition, the margin 
theory points to a strong connection between boosting and the support-vector 
machines of Vapnik and others mm which explicitly attempt to maximize the 
minimum margin. 



A connection to game theory 



The behavior of AdaBoost can also be understood in a game-theoretic setting 
as explored by Freund and Schapire trarzi (see also Grove and Schuurmans m 
and Breiman 0). In classical game theory, it is possible to put any two-person, 
zero-sum game in the form of a matrix M. To play the game, one player chooses 
a row i and the other player chooses a column j. The loss to the row player 
(which is the same as the payoff to the column player) is My. More generally, 
the two sides may play randomly, choosing distributions P and Q over rows or 
columns, respectively. The expected loss then is P"'"MQ. 

Boosting can be viewed as repeated play of a particular game matrix. Assume 
that the weak hypotheses are binary, and let TL = {hi, ...hn} be the entire weak 
hypothesis space (which we assume for now to be finite). For a fixed training set 
(xi,yi ), . . . , {xm, ym), the game matrix M has m rows and n columns where 



My — 



1 if hj{xi) = yi 
0 otherwise. 
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The row player now is the boosting algorithm, and the column player is the 
weak learner. The boosting algorithm’s choice of a distribution Dt over training 
examples becomes a distribution P over rows of M, while the weak learner’s 
choice of a weak hypothesis ht becomes the choice of a column j of M. 

As an example of the connection between boosting and game theory, consider 
von Neumann’s famous minmax theorem which states that 



for any matrix M. When applied to the matrix just defined and reinterpreted 
in the boosting setting, this can be shown to have the following meaning: If, 
for any distribution over examples, there exists a weak hypothesis with error 
at most 1/2 — 7 , then there exists a convex combination of weak hypotheses 
with a margin of at least 2y on all training examples. AdaBoost seeks to find 
such a final hypothesis with high margin on all examples by combining many 
weak hypotheses; so in a sense, the minmax theorem tells us that AdaBoost 
at least has the potential for success since, given a “good” weak learner, there 
must exist a good combination of weak hypotheses. Going much further, Ada- 
Boost can be shown to be a special case of a more general algorithm for playing 
repeated games, or for approximately solving matrix games. This shows that, 
asymptotically, the distribution over training examples as well as the weights 
over weak hypotheses in the final hypothesis have game-theoretic intepretations 
as approximate minmax or maxmin strategies. 

Estimating probabilities 

Classification generally is the problem of predicting the label y of an example x 
with the intention of minimizing the probability of an incorrect prediction. How- 
ever, it is often useful to estimate the probability of a particular label. Recently, 
Friedman, Hastie and Tibshirani HE) suggested a method for using the output of 
AdaBoost to make reasonable estimates of such probabilities. Specifically, they 
suggest using a logistic function, and estimating 



where, as usual, f{x) is the weighted average of weak hypotheses produced by 
AdaBoost. The rationale for this choice is the close connection between the log 
loss (negative log likelihood) of such a model, namely. 



maxmin P"'"MQ = minmaxP'^MQ 
Q P p Q 



e- 



J{^) 



Pr/ b = +1 I x] 



gf{x) _|_ g-/(a:) 



( 4 ) 




( 5 ) 



and the function which, we have already noted, AdaBoost attempts to minimize: 
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Specifically, it can be verified that Eq. © is upper bounded by Eq. ©. In 
addition, if we add the constant 1 — In 2 to Eq. © (which does not affect its 
minimization), then it can be verified that the resulting function and the one in 
Eq. © have identical Taylor expansions around zero up to second order; thus, 
their behavior near zero is very similar. Finally, it can be shown that, for any 
distribution over pairs (x,y), the expectations 



E 



In (l + 



and 



E 



o-vf{x) 



are minimized by the same function /, namely. 



f{x) = iln 



Pi- [y = +1 I x] 
Pr [y = — 1 I a;] 



Thus, for all these reasons, minimizing Eq. ©, as is done by AdaBoost, can be 
viewed as a method of approximately minimizing the negative log likelihood giv- 
en in Eq. ©. Therefore, we may expect Eq. © to give a reasonable probability 
estimate. 

Friedman, Hastie and Tibshirani also make other connnections between Ada- 
Boost, logistic regression and additive models. 



Multiclass classification 

There are several methods of extending AdaBoost to the multiclass case. The 
most straightforward generalization m, called AdaBoost. Ml, is adequate when 
the weak learner is strong enough to achieve reasonably high accuracy, even 
on the hard distributions created by AdaBoost. However, this method fails if 
the weak learner cannot achieve at least 50% accuracy when run on these hard 
distributions. 

For the latter case, several more sophisticated methods have been developed. 
These generally work by reducing the multiclass problem to a larger binary 
problem. Schapire and Singer ’sEU algorithm AdaBoost. MH works by creating a 
set of binary problems, for each example x and each possible label y, of the form: 
“For example x, is the correct label y or is it one of the other labels?” Freund 
and Schapire ’s(El algorithm AdaBoost. M2 (which is a special case of Schapire 
and Singer ’s oq AdaBoost. MR algorithm) instead creates binary problems, for 
each example x with correct label y and each incorrect label y' of the form: “For 
example x, is the correct label y or y'T' 

These methods require additional effort in the design of the weak learning 
algorithm. A different technique 1221 , which incorporates Dietterich and Bakir- 
i’sm method of error-correcting output codes, achieves similar provable bounds 
to those of AdaBoost. MH and AdaBoost. M2, but can be used with any weak 
learner which can handle simple, binary labeled data. Schapire and Singer m 
give yet another method of combining boosting with error-correcting output 
codes. 
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boosting stumps boosting C4.5 

Fig. 3. Comparison of C4.5 versus boosting stumps and boosting C4.5 on a set of 
27 benchmark problems as reported by Freund and Schapire PI- Each point in each 
scatterplot shows the test error rate of the two competing algorithms on a single bench- 
mark. The y-coordinate of each point gives the test error rate (in percent) of C4.5 on 
the given benchmark, and the x-coordinate gives the error rate of boosting stumps (left 
plot) or boosting C4.5 (right plot). All error rates have been averaged over multiple 
runs. 



Experiments and applications 

AdaBoost has been tested empirically by many researchers, including 
P3EEE3I. For instance, Freund and Schapire [m tested AdaBoost on a set of 
UCI benchmark datasets PI using C4.5 as a weak learning algorithm, as 
well as an algorithm which finds the best “decision stump” or single-test decision 
tree. Some of the results of these experiments are shown in Fig. El As can be 
seen from this figure, even boosting the weak decision stumps can usually give 
as good results as C4.5, while boosting C4.5 generally gives the decision-tree 
algorithm a significant improvement in performance. 

In another set of experiments, Schapire and Singer m used boosting for 
text categorization tasks. For this work, weak hypotheses were used which test 
on the presence or absence of a word or phrase. Some results of these experiments 
comparing AdaBoost to four other methods are shown in Fig. 0 In nearly all 
of these experiments and for all of the performance measures tested, boosting 
performed as well or significantly better than the other methods tested. 
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Reinforcement learning (RL) concerns the problem of a learning agent inter- 
acting with its environment to achieve a goal. Instead of being given examples 
of desired behavior, the learning agent must discover by trial and error how to 
behave in order to get the most reward. The environment is a Markov decision 
process (MDP) with state set, S, and action set, A. The agent and the environ- 
ment interact in a sequence of discrete steps, t = 0, 1, 2, . . . The state and action 
at one time step, s* S 5 and a* S A, determine the probability distribution for 
the state at the next time step, St+i G S, and, jointly, the distribution for the 
next reward, rt+i G 3?. The agent’s objective is to chose each at to maximize the 
subsequent return: 

OO 

Rt = 

k=0 

where the discount rate, 0 < 7 < 1, determines the relative weighting of im- 
mediate and delayed rewards. In some environments, the interaction consists of 
a sequence of episodes, each starting in a given state and ending upon arrival 
in a terminal state, terminating the series above. In other cases the interaction 
is continual, without interruption, and the sum may have an infinite number of 
terms (in which case we usually assume 7 < 1). Infinite horizon cases with 7 = 1 
are also possible though less common (e.g., see Mahadevan, 1996). 

The agent’s action choices are a stochastic function of the state, called a 
policy, 7T : 5 I— > Pr{A). The value of a state given a policy is the expected return 
starting from that state following the policy: 

U’"'(s) = E{Rt I St = s,7t}, 

and the best that can be done in a state is its optimal value: 

U*(s) = maxU’"’(s). 

7T 

There is always at least one optimal policy, tt*, that achieves this maximum 
at all states s G S. Paralleling the two state-value functions defined above are 
two action-value functions, Q'^{s,a) = E{Rt \ St = s,at = o,7t} and Q*{s,a) = 
maxTT {s , a) . ^From Q* one can determine an optimal deterministic policy, 

7T*(s) = argmaxa Q*{s, a). For this reason, many RL algorithms focus on approx- 
imating Q* . For example, one-step tabular Q-learning (Watkins, 1989) maintains 
a table of estimates Q{s, a) for each pair of state and action. Whenever a is taken 
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in s, Q(s,a) is updated based on the resulting next state s', and reward r: 

Q(s, a) ^ (1 - asa)Q(s,a) + asa[r + max(5(s', a')], (1) 

a' 

where Uga > 0 is a time-dependent step-size parameter. Under minimal technical 
conditions, Q converges asymptotically to Q * , from which an optimal policy can 
be determined as described above (Watkins and Dayan, 1992). 

Modern RL encompasses a wide range of problems and algorithms, of which 
the above is only the simplest case. For example, all the large applications of RL 
use not tables but parameterized function approximators such as neural networks 
(e.g., Tesauro, 1995; Crites and Barto, 1996; Singh and Bertsekas, 1997). It is 
also commonplace to consider planning — the computation of an optimal policy 
given a model of the environment — as well as learning (e.g., Moore and Atkeson, 
1993; Singh, 1993). RL can also be used when the state is not completely ob- 
servable (e.g.. Loch and Singh, 1998). The methods that are effectively used in 
practice go far beyond what can be proven reliable or efficient. In this sense, the 
open theoretical questions in RL are legion. Here I highlight four that seem par- 
ticularly important, pressing, or opportune. The first three are basic questions 
in RL that have remained open despite some attention by skilled mathemati- 
cians. Solving these is probably not just a simple matter of applying existing 
results; some new mathematics may be needed. The fourth open question con- 
cerns recent progress in extending the theory of uniform convergence and VC 
dimension to RL. For additional general background on RL, I recommend our 
recent textbook (Sutton and Barto, 1998). 

1 Control with Function Approximation 

An important subproblem within many RL algorithms is that of approximating 
Q'" or V" for the policy tt used to generate the training experience. This is called 
the prediction problem to distinguish it from the control problem of RL as a 
whole (finding Q* or tt*). For the prediction problem, the use of generalizing 
function approximators such as neural networks is relatively well understood. 
In the strongest result in this area, the TD(A) algorithm with linear function 
approximation has been proven asymptotically convergent to within a bounded 
expansion of the minimum possible error (Tsitsiklis and Van Roy, 1997). In 
contrast, the extension of Q-learning to linear function approximation has been 
shown to be unstable (divergent) in the prediction case (Baird, 1995). This pair 
of results has focused attention on Sarsa(A), the extension of TD(A) to form a 
control algorithm. 

Empirically, linear Sarsa(A) seems to perform well despite (in many cases) 
never converging in the conventional sense. The parameters of the linear function 
can be shown to have no fixed point in expected value. Yet neither do they 
diverge; they seem to “chatter” in the neighborhood of a good policy (Bertsekas 
and Tsitsiklis, 1996). This kind of solution can be completely satisfactory in 
practice, but can it be characterized theoretically? What can be assured about 
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the quality of the chattering solution? New mathematical tools seem necessary. 
Linear Sarsa(A) is thus both critical to the success of the RL enterprise and 
greatly in need of new learning theory. 



2 Monte Carlo Control 

An important dimension along which RL methods differ is their degree of boot- 
strapping. For example, one-step Q-learning bootstraps its estimate for Q(s,a) 
upon its estimates for Q{s',a') (see Eq. dj, that is, it builds its estimates upon 
themselves. Non-bootstrapping methods, also known as Monte Carlo methods, 
use only actual returns — no estimates — as their basis for updating other es- 
timates. The A in methods such as TD(A), Q(A), and Sarsa(A) refers to this 
dimension, with A = 0 (as in TD(0)) representing the most extreme form of 
bootstrapping, and A = 1 representing no bootstrapping (Monte Carlo method- 
s). 

In most respects, the theory of Monte Carlo methods is better developed 
than that of bootstrapping methods. Without the self reference of bootstrap- 
ping, Monte Carlo methods are easier to analyze and closer to classical methods. 
In linear prediction, for example, Monte Carlo methods have the best asymp- 
totic convergence guarantees. For the control case, however, results exist only 
for extreme bootstrapping methods, notably tabular Q(0) and tabular Sarsa(O). 
For any value of A > 0 there are no convergence results for the control case. This 
lacunae is particularly glaring and galling for the simplest Monte Carlo algorith- 
m, Monte Carlo ES (Sutton and Barto, 1998). This tabular method maintains 
Q(s, a) as the average of all completed returns (we assume an episodic interac- 
tion) that started with taking action a in state s. Actions are selected greedily, 
7t(s) = argmaxa Q(s, o), while exploration is assured by assuming exploring s- 
tarts (ES) — that is, that episodes start in randomly selected state-action pairs 
with all pairs having a positive probability of being selected. It is hard to imagine 
any RL method simpler or more likely to converge than this, yet there remain no 
proof of asymptotic convergence to Q*. While this simplest case remains open 
we are unlikely to make progress on any control method for A > 0. 



3 Efficiency of Bootstrapping 

Perhaps the single most important new idea in the field of RL is that of temporal- 
difference (TD) learning with bootstrapping. Bootstrapping TD methods have 
been shown empirically to learn substantially more efficiently than Monte Carlo 
methods. For example, FigureQpresents a collection of empirical results in which 
A was varied from 0 (pure bootstrapping) to 1 (no bootstrapping, Monte Carlo) . 
In all cases, performance at 0 was better than performance at 1, and the best 
performance was at an intermediate value of A. Similar results have been shown 
analytically (Singh and Dayan, 1998), but again only for particular tasks and 
initial settings. Thus, we have a range of results that suggest that bootstrapping 
TD methods are generally more efficient than Monte Carlo methods, but no 
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definitive proof. While it remains unclear exactly what should or could be proved 
here, it is clear that this is a key open question at the heart of current and future 



RL. 



Mountain Car 



Random Walk 





RMS error 



Puddle World 




300 
250 
200 
150 
100 
50 

0 0.2 0.4 0.6 0.8 1 



Cart AND Pole 







Failures per 
100,000 steps 



Fig. 1. The effect of A on RL performance. In all cases, the better the per- 
formance, the lower the curve. The two left panels are applications to simple 
continuous-state control tasks using the Sarsa(A) algorithm and tile coding, with 
either replacing or accumulating traces (Sutton, 1996). The upper-right panel 
is for policy evaluation on a random walk task using TD(A) (Singh and Sutton, 
1996). The lower right panel is unpublished data for a pole-balancing task from 
an earlier study (Sutton, 1984). 



4 A VC Dimension for RL 

So far we have discussed open theoretical questions at the heart of RL that are 
distant from those usually considered in computational learning theory (COLT). 
This should not be surprising; new problems are likely to call for new theory. But 
it is also worthwhile to try to apply existing theoretical ideas to new problems. 
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Recently, some progress has been made in this direction by Kearns, Mansour 
and Ng (in prep.) that seems to open up a whole range of new possibilities for 
applying COLT ideas to RL. 

Recall the classic COLT problem defined by a hypothesis space Ti of functions 
from X toy together with a probability distribution V on Xxy. Given a training 
set of X, y pairs chosen according to V, the objective is to find a function h gH 
that minimizes the generalization error. A basic result establishes the number 
of examples (on the order of the VC dimension of Ti.) necessary to assure with 
high probability that the generalization error is approximately the same as the 
training error. 

Kearns, Mansour and Ng consider a closely related planning problem in RL. 
Corresponding to the set of possible functions H, they consider a set of pos- 
sible policies n. For example, 77 could be all the greedy policies formed by 
approximating an action-value function with a neural network of a certain size. 
Corresponding to the probability distribution V on X x y, Kearns et al. use a 
generative or sample model of the MDP. Given any state s and action a, the 
model generates samples of the next state s' and the expected value of the next 
reward r, given s and a. They also allow the possibility that the environment is a 
partially observable (PO) MDP, in which case the model also generates a sample 
observation o, which alone is used by policies to select actions. Corresponding to 
the classical objective of finding an h G H that minimizes generalization error, 
they seek a policy tt G 77 that maximizes performance on the (PO)MDP. Per- 
formance here is defined as the value, K^(sq), of some designated state state, sq 
( or, equivalently, on a designated distribution of starting states). 

But what corresponds in the RL case to the training set of example x, y pairs? 
A key property of the conventional training set is that one such set can be reused 
to evaluate the accuracy of any hypothesis. But in the RL case different policies 
give rise to different action choices and thus to different parts of the state space 
being encountered. How can we construct a training set with a reuse property 
comparable to the supervised case? Kearns et al.’s answer is the trajectory tree, 
a tree of sample transitions starting at the start state and branching down along 
all possible action choices. For each action they obtain one sample next state 
and the expected reward from the generative model. They then recurse from 
these states, considering for each all possible actions and one sample outcome. 
They continue in this way for a sufficient depth, or horizon, 77, such that 7 ^ is 
sufficiently small with respect to the target regret, e. If there are two possible 
actions, then one such tree is of size 2 ^, which is independent of the number 
of states in the (PO)MDP. The reuse property comes about because a single 
tree specifies a length 77 sample trajectory for any policy by working down the 
tree following the actions taken by that policy. A tree corresponds to a single 
example in the classic supervised problem, and a set of trees corresponds to s 
training set of examples. 

With the trajectory tree construction, Kearns et al. are able to extend basic 
results of uniform convergence. The conventional definition of VC dimension 
cannot be directly applied to policy sets 77, but by going back to the original 
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definitions they establish a natural extension of it. They prove that with (on 
order of) this number of trajectory trees, with probability 6, one can be assured 
of finding a policy whose value is within e of the best policy in 77. 

Kearns, Mansour and Ng’s work breaks fertile new ground in the theory 
of RL, but it is far from finishing the story. Their work could be extended in 
many different directions just as uniform convergence theory for the supervised 
case has been elaborated. For example, one could establish the VC dimension 
on some policy classes of practical import, or extend boosting ideas to the RL 
case. Alternatively, one could propose replacements for the supervised training 
examples other than trajectory trees. Kearns et al. consider how trajectories 
from random policies can be used for this purpose, and there are doubtless other 
possibilities as well. 
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Abstract. AdaBoost is a popular and effective leveraging procedure 
for improving the hypotheses generated by weak learning algorithms. 
AdaBoost and many other leveraging algorithms can be viewed as per- 
forming a constrained gradient descent over a potential function. At each 
iteration the distribution over the sample given to the weak learner is 
the direction of steepest descent. We introduce a new leveraging algo- 
rithm based on a natural potential function. For this potential function, 
the direction of steepest descent can have negative components. There- 
fore we provide two transformations for obtaining suitable distributions 
from these directions of steepest descent. The resulting algorithms have 
bounds that are incomparable to AdaBoost ’s, and their empirical per- 
formance is similar to AdaBoost’s. 



1 Introduction 

Algorithms like AdaBoost |Z| that are able to improve the hypotheses generated 
by weak learning methods have great potential and practical benefits. We call 
any such algorithm a leveraging algorithm, as it leverages the weak learning 
method. Other examples of leveraging algorithms include bagging |3|, arc-x4 jS|, 
and LogitBoost |B|. 

One class of leveraging algorithms follows the following template to construct 
master hypotheses from a given sample . . . (xm, Vm)- 

The leveraging algorithm begins with a default master hypothesis Hq 
and then for t = 1, 2, . . . , T it: 

— Constructs a distribution Dt over the sample (as a function of the 
sample and the current master hypothesis Ht-i, and possibly t). 

— Trains a weak learner using distribution Dt over the sample to obtain 
a weak hypothesis ht. 

— Picks at and creates the new master hypothesis, 

Ht = Ht-i + atht- 
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This is essentially the Arcing paradigm introduced by Breiman m and the 
skeleton of AdaBoost and other boost-by-resampling algorithms m- Although 
leveraging algorithms include arcing algorithms following this template, leverag- 
ing algorithms are more general. In Section |21 we introduce the GeoLev algorithm 
that changes the examples in the sample as well as the distribution over them. 
In this paper we consider 2-class classification problems where each yi G {— 1,-|-1}. 
However, following Schapire and Singer [El , we allow the weak learner’s hypothe- 
ses to be “confidence rated,” mapping the domain X to the real numbers. The 
sign of these numbers gives the predicted label, and the magnitude is a mea- 
sure of confidence. The master hypotheses produced by the above template are 
interpreted in the same way. 

Although the underlying goal is to produce hypotheses that generalize well, 
we focus on how quickly the leveraging algorithm decreases the sample error. 
There are a variety of results bounding the generalization error in terms of the 
performance on the sample Hi- 

Given a sample . . . {xm,ym), the margin of a hypothesis h on instance 

Xi is yih{xi) and the margin of h on the entire sample is the vector (yih(xi), . . . , 
ynh(xn))- A hypothesis that correctly labels the sample has a margin vector 
whose components are all positive. Focusing on these margin vectors provides a 
geometric intuition about the leveraging problem. 

In particular, a potential function on margin space can be used to guide the 
choices of Dt and at- The distribution Dt is the direction of steepest descent 
and at is the value that minimizes the potential of Lft-i + atht- Leveraging 
algorithms that can be viewed in this way perform a feasible direction descent 
on the potential function. An amortized analysis using this potential function can 
often be used to bound the number of iterations required to achieve zero sample 
error. These potential functions give insight into the strengths and weaknesses 
of various leveraging algorithms. 

Boosting algorithms have the property that they can convert weak PAG learning 
algorithms into strong PAG learning algorithms. Although the theory behind 
the Adaboost algorithm is very elegant, it leads to the somewhat intriguing 
result that minimizing the normalization factor of a distribution will reduce the 
training error CUSI. Our search for a better understanding of how AdaBoost 
reduces the sample error led to our geometric algorithms, GeoLev and GeoArc. 
Although the performance bounds for these algorithms are too poor to show that 
they have the boosting property, these bounds are incomparable to AdaBoost’s 
in that they are better when the weak hypotheses contain mostly low-confidence 
predictions. 

The main contributions of this paper are as follows: 



— We use a natural potential function to derive a new algorithm for leveraging 
weak learners, called GeoLev (for Geometric Leveraging algorithm). 

— We highlight the relationship between AdaBoost, Arcing and feasible direc- 
tion linear programming HD|. 



20 



Nigel Duffy and David Helmbold 



— We use our geometric interpretation to prove convergence bounds on the 
algorithm GeoLev. These bound the number of iterations taken by GeoLev 
to achieve e classification error on the training set. 

— We provide a general transformation from GeoLev to an arcing algorithm 
Geo Arc, for which the same bounds hold. 

— We summarize some preliminary experiments with GeoLev and Geo Arc. 



2 GeoLev - A Geometric Algorithm 



We motivate a novel algorithm, GeoLev, by considering the geometry of “margin 
space.” Since many empirical and analytical results show that good margins on 
the sample lead to small generalization error PIEj, it is natural to seek a master 
hypothesis with large margins. One heuristic is to seek a margin vector with 
uniformly large margins, i.e. a vector parallel to 1 = (1,1,...,!). This indicates 
that the master hypothesis is correct and equally confident on every instance in 
the sample. The GeoLev algorithm exploits this heuristic by attempting to find 
hypotheses whose margin vectors are as close as possible to the 1 direction. 

We now focus on a single iteration of the leveraging process, dropping the time 
subscripts. Margin vectors will be printed in bold face and often normalized to 
have Euclidean length one. Thus H is the margin vector of the master hypothesis 
H, whose component is 



H, = 



yiH{xj) 



( 1 ) 



Let the goal vector, g = . . . ,1/y/m), be 1 normalized to length one. 

Recall that m is the sample size, so all margin vectors lie in 3?"*, and normalized 
margin vectors lie on the m dimensional unit sphere. Note that it is easy to 
re-scale the confidences - multiplying the predictions of any hypothesis iL by a 
constant does not change the direction of Lf’s margin vector. Therefore we can 
assume the appropriate normalization without loss of generality. 

The first decision taken by the leverager is what distribution D to place on 
the sample. Since distribution D has m components, it can also be viewed as a 
(non-negative) vector in 3?™. 

The situation in margin-space at the start of the iteration is shown in Figure E 
In order to decrease the angle 0 between H and g we must move the head of H 
towards g. All vectors at angle 9 to the goal vector g lie on a cone, and their 
normalizations lie on the “rim” shown in the figure. 

If h, the weak hypothesis’s margin vector (which need not have unit length), is 
parallel to H or tangent to the “rim” , then no addition of h to H can decrease 
the angle to g. On the other hand, if the line H -f ah cuts through the cone, 
then the angle to the goal vector g can be reduced by adding some multiple of 
h to H. The only time the angle to g cannot be decreased is when the h vector 
lies in the plane P which is tangent to the cone and contains the vector H, as 
shown in Figure |21 
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Fig. 1. Situation in margin space at the start of an iteration. 



If the weak learner learns at all, then its hypothesis h is better than random 
guessing, so the learners “edge”, Ei^D{yih{xi)), will be positive. This means 
that D • h is positive, and if distribution D (viewed as a margin vector) is 
perpendicular to plane P then h lies above P. Therefore the leverager is able to 
use h to reduce the angle between H and g. 

As suggested by the figures, the appropriate direction for D is 



D = g-H(g H). 



( 2 ) 



In general neither ||D||i = 1 nor ||D ||2 = 1. 

If all components of D are positive, it can be normalized to yield a distribution 
on the sample for the weak learner. However, it is possible for some components 
of D to be negative. In this case things are more complicatecQ. If a component 
of D is negative, then we flip both the sign of that component and the sign of 
the corresponding label in the sample. This creates a new direction D' which 
can be normalized to a distribution D' and a new sample S' with the same a;i’s 
but (possibly) new labels y'. The modified sample S' and distribution D' are 
then used to generate a new weak hypothesis, h. Let h' be the margins of h on 
the modified sample S' , so h' = y'^h{xi). Now, 

m m 

D' • h' = ^ B'.y'Mx^) = = D • h (3) 

i—1 i—1 

^ In fact it is this complication which differentiates GeoLev from Arcing algorithms. 
Arcing algorithms are not permitted to change the sample in this way. A second 
transformation avoiding the label flipping is discussed in section^ 



22 



Nigel Duffy and David Helmbold 




Fig. 2. The direction D for the distribution used by GeoLev. 



as the sign flips cancel. 

The second decision taken by the algorithm is how to incorporate the weak 
hypothesis h into its master hypothesis H . Any weak hypothesis with an “edge” 
on the distribution D described above can be used to decrease 9. Our goal is 
to And the coefflcient a so that H' = || h - h ^^|2 decreases this angle as much as 
possible. Taking derivatives shows that 9 is minimized when 

g.(h-H(H-h)) 

g • (H(h • h) - h(H • h)) ■ ^ ^ 

^From this discussion we can see that GeoLev performs a kind of gradient de- 
scent. If we consider the angle between g and the current H as a potential on 
margin space, then D is the direction of steepest descent. Moving in a direction 
that approximates this gradient takes us towards the goal vector. Since we have 
only little control over the hypotheses returned by the weak learner, an approxi- 
mation to this direction is the best we can do. The step size is chosen adaptively 
to make as much use of the weak hypothesis as possible. 

The GeoLev Algorithm is summarized in Figure 0 



3 Relation to Previous Work 

Breiman |5I4| defines arcing algorithms using potential functions that can be ex- 
pressed as component-wise functions of the margins having the forrrQ /(Hi). 

^ Breiman allows the component- wise potential / to depend on the sum of the Oi’s in 
some arcing algorithms. 
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Input: A sample S = {{xi,yi), (* 2 ,^ 2 ), • • • , and 

a weak learning algorithm. 

Initialize master hypothesis H to predict 0 everywhere 
Set g = (1/ y/m, 1/ 

Repeat: 

D = g - H(g H) 

S" = 0 

for i = 1 to m do 
if Di < 0 then 

add [xi, —yi) to S' 

else 

add (xi,yi) to S' 

end do 

call weak learner with distribution D over S' , obtaining hypothesis h 

g.(h-H(H-h)) 

“ g.(H(h.h)-h(H.h)) 

H=(H + ah)/v^E"i(H, + ah02 



Fig. 3. The GeoLev Algorithm. 



Breiman shows that, under certain conditions on /, arcing algorithms converge 
to good hypotheses in the limit. Furthermore, he shows that AdaBoost is an arc- 
ing algorithm with f{x) = and arc-x4 is an arcing algorithm with polynomial 
fix). 

For completeness, we describe the AdaBoost algorithm and show in our notation 
how it is performing feasible direction gradient descent on the potential function 
YffiLi e^L AdaBoost fits the template outlined in the introduction, choosing the 
distribution 

u(«) = (5) 

where Z is the normalizing factor so that D sums to 1. The master hypothesis 
is updated by adding in a multiple of the new weak hypothesis. The coefficient 
a is chosen to minimize 



^exp(-(H, -b ah,)) , (6) 

i=l 

the next iteration’s Z value. Unlike GeoBoost, the margin vectors of AdaBoost’s 
hypotheses are not normalized. 

We now show that AdaBoost can be viewed as minimizing the potential 



^exp(-H,) 



( 7 ) 
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by approximate gradient descent. The direction of steepest descent (w.r.t. the 
components of the margin vector) is proportional to the distribution Ad- 
aBoost gives to the weak learner. 

Continuing the analogy, the coefficient a given to the new hypothesis should 
minimize the potential, 

^exp(-(H + ah),), (8) 

i 

of the updated master hypothesis, which is identical to (0. Thus AdaBoost’s 
behavior is approximate gradient descent of the function defined in (Q, where the 
direction of descent is the weak learner’s hypothesis. Furthermore, the bounds on 
AdaBoost’s performance proven by Schapire and Singer are implicitly performing 
an amortized analysis over the potential function (j8j) . 

Arc-x4 also fits the template outlined in the introduction, keeping an unnor- 
malized master hypothesis. In our notation the distribution chosen at trial t is 
proportional to 

16+(t-H,)4. (9) 

This algorithm can also be viewed as a gradient descent on the potential function 

^16H,-l(t-H,)5 (10) 

i 

at the iteration. Rather than computing the coefficient a as a function of 
the weak hypothesis, arc-x4 always chooses a = 1. Thus each ht has weight 1/t 
in the master hypothesis, as in many gradient descent methods. Unfortunately, 
the dependence of the potential function on t makes it difficult to use in an 
amortized analysis. 

This connection to gradient descent was hinted at by Freund jOj and noted by 
Breiman and others mm- Our interpretation generalizes the previous work 
by relaxing the constraints on the potential function. In particular, we show how 
to construct algorithms from potential functions where the direction of steepest 
descent can have negative components. The potential function view of leveraging 
algorithms shows their relationship to feasible descent linear programming, and 
this relationship provides insight into the role of the weak learner. 

Feasible direction methods try to move in the direction of steepest descent. 
However, they must remain in the feasible region described by the constraints. 
A descent direction is chosen that is closest to the (negative) gradient —V/ while 
satisfying the constraints. For example, in a simplified Zoutendijk method, the 
chosen direction d satisfies the constraints and maximizes — V/-d . Similarly, the 
leveraging algorithms discussed are constrained to produce master hypotheses 
lying in the span of the weak learner’s hypothesis class. One can view the role 
of the weak learner as finding a feasible direction close to the given distribution 
(or negative gradient). In fact the weak learning assumption used in boosting 
and in the analysis of GeoLev implies that there is always a feasible direction d 
such that — V/ • d is bounded above zero. 



A Geometric Approach to Leveraging Weak Learners 



25 



The gradient descent framework outlined above provides a method for deriving 
the corresponding leveraging algorithm from smooth potential functions over 
margin space. 

The potential functions used by AdaBoost and arc-x4 have the advantage that 
all the components of their gradients D are positive, and thus it is easy to convert 
D into a distribution. On the other hand, the methods outlined in the previous 
section and section can be used to handle gradients with negative components. 
The approach used by Ratsch et al. uni can similarly be interpreted as a potential 
function of the margins. 

Recently, Friedman et al. ^ have given a maximum likelihood motivation for Ad- 
aBoost, and introduced another leveraging algorithm based on the log-likelihood 
criteria. They indicate that minimizing the square loss potential, (H — 1)^ per- 
formed less well in experiments than other monotone potentials, and conjecture 
that its non-monotonicity (penalizing margins greater than 1) is a contributing 
factor. Our methods described in sectionElmay provide a way to ameliorate this 
problem. 



4 Convergence Bonnd 



In this section we examine the number of iterations required by GeoLev to 
achieve classification error e on the sample. The key step shows how the sine 
of the angle between the goal vector g and the master hypothesis H is reduced 
each iteration. Upper bounding the resulting recurrence gives a bound on how 
rapidly the training error decreases. 

We begin by considering a single boosting iteration. The margin space quantities 
D,g,H,h,0 are as previously defined (recall that g and H are 2-normed, while 
D and h are not). In addition, let H' denote the new master hypothesis at the 
end of the iteration, and 6' the angle between H' and g. We assume throughout 
that the sample is finite. 

Define r = (D • h) to be the edge of the weak learner’s hypothesis h with respect 
to the distribution given to the weak learner. Our bound on the decrease in 6 
will depend on h only through r and ||h|| 2 . Note that r was chosen to maintain 
consistency with the work of Schapire and Singer m and that 

Pi^D[h(a;i) yf yi] = (11) 



At the start of the iteration sin(0) = \/T— cos2(6») = ^1 - (g • H) 2, and at the 
end of the iteration sin(0') = y^l — cos^(0') = \/l — (g • H')^. Recall that H' is 
H -1- ah normalized, and since H already has unit length. 



H' = 



H-kah 

v'l-ka2(h-h)-k2a(H-hy ' 



( 12 ) 



Lemma 1. The value cos'^{9') is maximized (and sin(0') minimized) when 



(g-h)-(g-H)(H-h) 
(g-H)(h.h)-(g.h)(H-h) ■ 



a = 



(13) 
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Proof The lemma follows from examination of the first and second derivatives 
of cos^(0') with respect to a. □ 

Using this value of a a little algebra shows that 



cos2(6»') 



(H • h)2(g . H)2 - ((g • h) - (g • H)(H • h))2 



((H-h)2-(h.h)) 

(H ■ h)2(g ■ H)2 - (D ■ h)^ - (h ■ h)(g ■ H)" 
((H-h)2-(h-h)) 



(h-h)(g.H) 



^ 14 ) 



(15) 



Although we desire bounds that hold for all h, we find it convenient to first 
minimize dEJ with respect to (H • h). The remaining dependence on h will be 
expressed as a function of r and ||h ||2 in the final bound. 



Lemma 2. Equation / I/.5I) is minimized when (H • h) = 0. 



Proof Again the lemma follows after examining the first and second derivatives 



with respect to (H • h). 

This considerably simplifies (1151). yielding 




□ 


cos2(0')>(g-H)2+i^ 


h)^ 

h) ■ 


(16) 


Recall that 




(17) 


Therefore, 






cos^(.')>(g-H)^+„;„2l 


l|D||? • 


(18) 



We will bound this in two ways, using different bounds on | |D| |i . The first of these 
bounds is derived by noting that ||D||i > ||D|| 2 . Recall that D = g — H(g • H), 
so (D-D) = (l-(g.H)2) = sin^(0). Combining this with ll I isll and the bound 
on ||D||i yields 



sin2(0')<l-(g-H)2-isin2W (19) 

I|h|l2 

sin(6»') < sin(6»)^l - ||^ . (20) 

Repeated application of this bound yields the following theorem. 

Theorem 1. If ri, . . . ,tt are the edges of the weak learner’s hypotheses during 
the first T iterations, then the sine of the angle between g and the margin vector 

T 

for the master hypothesis computed at iteration T is at most 

t=i 
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We can bound ||D||i another way to obtain a a bound which is often better. 
Note that ||D||i > (D • 1) = i/m(D • g) = ^/m{l — (g • H)^). Substituting this 
into (CHJ and continuing as before yields 

cos^(0O > (g • H)2 + ^^(1 - (g • af? 

< sin^(0) — ..„ m . 

Continuing as above results in the following theorem. 

Theorem 2. Let ri,...,rT be the edges of the weak learner’s hypotheses and 
he the angles between g and the margins of the master hypotheses at 
the start of the first T iterations. If 9t+i is the angle between g and the margins 
of the master hypothesis produced at iteration T then 

sm{9T+i) j^^^rnsm^{9t) . ( 23 ) 



(21) 

(22) 



To relate these results to the sample error we use the following lemma. 

Lemma 3. If sin(0) < -y/e where 9 is the angle between g and a master hypoth- 
esis H, then the sample error of H is less than e. 

Proof Assume sin(0) < ^Rjm, so cos(0) = g-H > — lC)lm, and > 

y/m — R. Since H is 2-normed, this can only hold if H has more than m — R 
positive components. Therefore the master hypothesis correctly classifies more 
than m — R examples and the sample error rate is at most (i? — l)/m. □ 

Combining Lemma El and Theorem El gives the following corollary. 

Corollary 1. After iteration T, the sample error rate of GeoLev’s master hy- 
pothesis is bounded by 

n 

The recurrence of Theorem Elis somewhat difficult to analyze, but we can apply 
the following lemma from Abe et al. p. 

Lemma 4. Consider a sequence {gt\ of non-negative numbers satisfying gt-i-i < 

gt — cgf, where c> Q is a positive constant. If ft = — -p then gt < ft for 

c{t+—] 

V 9oc J 

all t G N . 



Given a lower bound r on the rt values and an upper bound Ti .2 on ||ht|| 2 , then 

we can apply this lemma to recurrence Setting go = 1, gt = sm^{9t) and 
2 

c = At TO shows that 
^2 



sm^(0T+i) < 



nl 

r^mT + nl ■ 



(25) 
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This, and the previous results lead to the following theorem. 

Theorem 3. If the weak learner always returns hypotheses with an edge greater 
than r and H 2 is an upper bound on ||ht|| 2 , then GeoLev’s hypothesis will have 
at most e training error after 



min 



( 1 i»(t) nin-t) 

VI” 



(26) 



iterations. 



Similar bounds have been obtained by Freund and Schapire 



13 for AdaBoost. 



Theorem 4. After T iterations, the sample error rate of AdaBoost’s master 
hypothesis is at most 



n 





(27) 



The dependence on ||h||oo is implicit in their bounds and and can be removed 
when ht{xi) G [-1,+!]. 

Comparing CorollaryEand Theorem EJleads to the following observations. First, 
the bound on GeoLev does not contain the square-root. If this were the only 
difference, then it would correspond to a halving of the number of iterations 
required to reach error rate e on the sample. This effect can be approximated by 
a factor of 2 on the terms. 

A more important difference is the factors multiplying the terms. With 
the preceding approximation GeoLev’s bound has 2msin^(0t)/||ht||2, while Ad- 
aboost’s bound has l/||ht||^. The larger this factor the better the bound. The 
dependence on sin^(0t) means that GeoLev’s progress tapers off as it approaches 
zero sample error. 

If the weak hypotheses are equally confident on all examples, then ||ht ||2 is m 
times larger than ||ht||^ and the difference in factors is simply 2sin^(0(). At 
the start of the boosting process 9t is close tt/ 2 and GeoLev’s factor is larger. 
However, sin^(6*t) can be as small as 1/m before GeoLev predicts perfectly on the 
sample. Thus GeoLev does not seem to gain as much from later iterations, and 
this difficulty prevents us from showing that GeoLev is a boosting algorithm. 
On the other hand, consider the less likely situation where the weak hypotheses 
produce a confident prediction for only one sample point, and abstain on the 
rest. Now ||ht||| = ||ht||^, and GeoLev’s bound has an extra factor of about 
2msin^(6*t). GeoLev’s bounds are uniformly bettei0 than AdaBoost’s in this 
case. 

® We must switch to recurrence I2()|i rather than recurrence J23) when sin^(0t) is very 
small. 



A Geometric Approach to Leveraging Weak Learners 



29 



5 Conversion to an Arcing Algorithm 

The GeoLev algorithm discussed so far does not fit the template for Arcing 
algorithms because it modifies the labels in the sample given to the weak learner. 
This also breaks the boosting paradigm as the weak learner may be required to 
produce a good hypothesis for data that is not consistent with any concept in 
the underlying concept class. In this section we describe a generic conversion 
that produces arcing algorithms from leveraging algorithms of this kind without 
placing an additional burden on the weak learner. Throughout this section we 
assume that the weak learner’s hypotheses produce values in [—1,+!]. 

The conversion introduces an wrapper between the weak learner and leveraging 
algorithm that replaces the sign-flip trick of section 0 This wrapper takes the 
(signed) weighting D from the leveraging algorithm, and creates the distribution 
D' by setting all negative components to zero and re-normalizing. This modified 
distribution D' is then given to the weak learner, which returns a hypothesis h 
with a margin vector h. The margin vector is modified by the wrapper before 
being passed on to the leveraging algorithm: if D{xi) is negative then is set 
to —1. Thus the leveraging algorithm sees a modified margin vector h' which it 
uses to compute a and the margins of the new master hypothesis. 

The intuition is that the leveraging algorithm is being fooled into thinking that 
the weak hypothesis is wrong on parts of the sample when it is actually cor- 
rect. Therefore the margins of the master hypothesis are actually better than 
those tracked by the leveraging algorithm. Furthermore, the apparent “edge” of 
the weak learner can only be increased by this wrapping transformation. This 
intuition is formalized in the following theorems. 

Theorem 5. If r = D' {xi)yih{xi) > 0 zs the edge of the weak learner with 
respect to the distribution it sees, and r' = '^^D{xi)yih' {xi) is the edge of the 
modified weak hypothesis with respect to the (signed) weighting D requested by 
the leueraging algorithm, then r' > r. 

Proof Let S'"*" = {z : D{xi) < 0} and p = The construc- 

tion ensures that both D'{xi) = D{xi)/p if z G S’"*" and zero otherwise, and 
D{xi)yih' {xi) = \D{xi)\ for all z ^ S~^ . Now, 

r' = ^ D{xf)yih' {xi) = l-p+ ^ D{xi)yih{xi) = 1 - p + pr . (28) 

i ieS+ 



The assumption on h implies r < 1, so r' is minimized at p = 1 where r' = r. □ 

Theorem 6. No component of the master margin vector H = ctth) used by 
the wrapped leveraging algorithm is ever greater than the actual margins of the 
master hypothesis athj. 



Proof The theorem follows immediately by noting that each component of h( 
is no greater than the corresponding component of hj . □ 
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We call the wrapped version of GeoLev, GeoArc, as it is an Arcing algorithm. 
It is instructive to examine the potential function associated with GeoArc: 



This potential has a similar form to the following potential function which is 
zero on the entire positive orthant: 



The leveraging framework we have described together with this transformation 
enables the analysis of some undifferentiable potential functions. The full impli- 
cations of this remain to be explored. 

6 Preliminary Experiments 

We performed experiments comparing GeoLev and GeoArc to AdaBoost on a 
set of 13 datasets(the 2 class ones used in previous experiments) from the UGI 
repository. These experiments were run along the same lines as those reported 
by Quinlan m- We ran 10 times 10 fold cross validation on the datasets for 
two class classification. All leveraging algorithms ran for 25 iterations, and used 
single node decision trees as implemented in M.CC++ 0 for the weak hypotheses. 
Note that these are ±1 valued hypotheses, with large 2-norms. It was noticed 
that the splitting criterion used for the single node had a large impact on the 
results. Therefore, the results reported for each dataset are those for the better of 
mutual information ratio and gain ratio. We report only a comparison between 
AdaBoost and GeoLev, GeoArc performed comparably to GeoLev. The results 
are illustrated in figure 0 This figure is a scatter plot of the generalization error 
on each of the datasets. These results appear to indicate that the new algorithms 
are comparable to AdaBoost. 

Further experiments are clearly warranted and we are especially interested in 
situations where the weak learner produces hypotheses with small 2-norm. 

7 Conclusions and Directions for Further Study 

We have presented the GeoLev and GeoArc algorithms which attempt to form 
master hypotheses that are correct and equally confident over the sample. We 
found it convenient to view these algorithms as performing a feasible direction 
gradient descent constrained by the hypotheses produced by the weak learner. 
The potential function used by GeoLev is not monotonic: its gradient can have 
negative components. Therefore the direction of steepest descent cannot simply 
be normalized to create a distribution for the weak learner. 
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Fig. 4. Generalization error of GeoLev versus AdaBoost after 25 rounds. 



We described two ways to solve this problem. The first constructing a modified 
sample by flipping some of the labels. This solution is mildly unsatisfying as it 
strengthens the requirements on the weak learner - the weak learner must now 
deal with a broader class of possible targets. Therefore we also presented a second 
transformation that does not increase the requirements on the weak learner. In 
fact, using this second transformation can actually improve the efficiency of the 
leveraging algorithm. One open issue is whether or not this improvement can be 
exploited to improve GeoArc’s performance bounds. A second open issue is to 
determine the effectiveness of these transformations when applied to other non- 
monotonic potential functions, such as those considered by Mason et al. mi- 

We have upper bounded the sample error rate of the master hypotheses produced 
by the GeoLev and Geo Arc algorithms. These bounds are incomparable with the 
analogous bounds for AdaBoost. The bounds indicate that GeoLev/GeoArc may 
perform slightly better at the start of the leveraging process and when the weak 
hypotheses contain many low-confidence predictions. On the other hand, the 
bounds indicate that GeoLev/GeoArc may not exploit later iterations as well, 
and may be less effective when the weak learner produces ±1 valued hypotheses. 
These disadvantages make it unlikely that the GeoArc algorithm has the boosting 
property. 

One possible explanation is that GeoLev/GeoArc aim at a cone inscribed in the 
positive orthant in margin space. As the sample size grows, the dimension of the 
space increases and the volume of the cone becomes a diminishing fraction of 
the positive orthant. AdaBoost ’s potential function appears better at navigating 
into the “corners” of the positive orthant. 

However, our preliminary tests indicate that after 25 iterations the generalization 
errors of GeoArc/GeoLev are similar to AdaBoost’s on 13 classification datasets 
from the UGI repository. These comparisons used 1-node decision tree classifiers 
as the weak learning method. It would be interesting to compare their relative 
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performances when using a weak learner that produces hypotheses with many 
low-confidence predictions. 
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Abstract. Recent works have shown the advantage of using Active Learn- 
ing methods, such as the Query by Committee (QBC) algorithm, to var- 
ious learning problems. This class of Algorithms requires an oracle with 
the ability to randomly select a consistent hypothesis according to some 
predefined distribution. When trying to implement such an oracle, for the 
linear separators family of hypotheses, various problems should be solved. 
The major problem is time-complexity, where the straight-forward Monte 
Carlo method takes exponential time. 

In this paper we address some of those problems and show how to convert 
them to the problems of sampling from convex bodies or approximating 
the volume of such bodies. We show that recent algorithms for approxi- 
mating the volume of convex bodies and approximately uniformly sam- 
pling from convex bodies using random walks, can be used to solve this 
problem, and yield an efficient implementation for the QBC algorithm. 
This solution suggests a connection between random walks and certain 
properties known in machine learning such as e-net and support vector 
machines. Working out this connection is left for future work. 



1 Introduction 

In the Active Learning paradigm Pj the learner is given access to a stream of 
unlabeled samples, drawn at random from a fixed and unknown distribution and 
for every sample the learner decides whether to query the teacher for the label. 
Complexity in this context is measured by the number of requests directed to 
the teacher along the learning process. The reasoning comes from many real life 
problems where the teacher’s activity is an expensive resource. For example, if 
one would like to design a program that classifies articles into two categories 
( “interesting” and “non-interesting” ) then the program may automatically scan 
as many articles as possible (e.g. through the Internet). However, articles which 
the program needs the teacher’s comment (tag) - the teacher must actually read, 
and that is a costly task. The Query By Committee (QBC) algorithm jUJ is an 
Active Learning algorithm acting in the Bayesian model of concept learning 

* Partially supported by project 1403-001.06/95 of the German-Israeli Foundation for 
Scientific Research [GIF]. 

P. Fischer and H.U. Simon (Eds.): EuroCOLT’99, LNAI 1572, pp. 34-^^ 1999. 
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0 i.e. it assumes that the concept to be learned is chosen according to some 
fixed distribution known to the learning algorithm. The algorithm uses three 
oracles: The Sample oracle returns a random sample x, the Label oracle returns 
the label(tag) for a sample, and the Gibbs oracle returns a random hypothesis 
from the version space. The algorithm gets two parameters - accuracy (a) and 
reliability (/3) - and works as follows: 

1. Call Sample to get a random sample x. 

2. Call Gibbs twice to obtain two hypotheses and generate two predictions for 
the label of x. 

3. If the predictions are not equal 

Then call Label to get the correct label for x. 

4. If Label was not used for the last consecutive samples, where k is the 
current number of labeled samples. 

Then call Gibbs once and output this last hypothesis 
Else return to the beginning of the loop (step 1). 

A natural mean for tracing the progress of the learning process is the rate at 
which the size of the version space decreases. We adopt the notion of information 
gain as the measure of choice for the analysis of the learning process: 

Definition 1 (Haussler et. al. 0). Let Vx = {h £ V\h{x) = c(a;)} be the 
version space after sample x had been labeled 

— The instantaneous information gain is T{x^c{x)) = — logPr/igv[^ G Vx] 

— The expected information gain of a sample x to is 

Q{x\V) = TL{Tvh(^v[h{x) = 1]) (1) 

= Vihav[h{x) = 1] • 1) + PrhevlH^) = -1] • 2i(a:, -1) 

where H is the binary entropy, i.e. TL{p) = —plogp— (1 — p)log(l — p). 



Theorem 1 (Freund et. al. JB]). If a concept class C has VG-dimension 0 < 
d < oo and the expected information gain from the queries to Label Oracle made 
by QBG are uniformly lower bounded by g > 0 bits, then the following holds with 
probability larger than 1 — /3 over the choice of the target concept, the sequence 
of samples, and the choices made by QBG: 

1. The number of calls to Sample is mo = 

2. The number of calls to Label is smaller thei^ fco = In . 

3. The algorithm guarantees that 

Prc,h,QBc[P^x[h{x) ^ c{x)] >a]<P (2) 

^ tfc = correction of the expression given at (jSl)- 

^ ko — 0(log results in an exponential gap between the number of queries made 
to Label, comparing to “regular” algorithms. 
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The main theme governing the proof of this theorem is the capability to bound 
the number of queries made by QBC in terms of g, the lower bound for the 
expected information gain: If the algorithm asks to tag all m samples then 
^ meaning the accumulated informa- 

tion gain grows logarithmically with m. Obviously, when filtering out samples 
the accumulated information gain cannot be larger. On the other hand, kg is a 
lower bound on the accumulated expected information gain from k tagged sam- 
ples. These two observations suggest that kg < {d + l)(log ^), which results 
in a bound on k and implies that the gap between consecutive queried samples 
is expected to grow until the stop-condition will be satisfied. Theorem 01 can 
be augmented to handle general class of filtering algorithms: Let L be an algo- 
rithm that filters out samples based on an internal probability assignment and 
previous query results. Using a stop-condition identical to the one used by the 
QBC algorithm and following the basic steps at the proof of theorem ^ one may 
conclude similar bounds on the number of calls L makes to Sample and Label 
oracles. 

By stating a lower bound on the expected information gain, Freund et. al. were 
able to identify several classes of concepts as learnable by the QBC algorithm. 
Among them are classes of perceptrons (linear separators) defined by a vector 
w such that for any sample x\ 



with the restriction that the version space distribution is known to the learner 
and both sample space and version space distributions are almost uniform. A 
question which was left open is how to efficiently implement the Gibbs oracle and 
thus reduce QBC to a standard learning model (using only Sample and Label 
oracles) . It turns out that this question falls naturally into a class of approximat- 
ing problems which got much attention in recent years: How to get an efficient 
approximation of volume or random sampling of rather complex defined and dy- 
namically changing spaces. Moreover, unraveling the meaning of random walks 
employed by these approximate counting methods seems to have interesting im- 
plications in learning theory. Let us focus on the problem of randomly selecting 
hypotheses from the version space and limit our discussion to classes of linear 
separators: There are several known algorithms for finding a linear separator 
(e.g. the perceptron algorithm), but none of them suffice since we need to ran- 
domly select a separator in the version space. A possible straightforward solution 
is the use of the Monte Carlo mechanism: Assuming (as later we do) that the 
linear separators are uniformly distributed, we randomly select a point in the 
unit spher^ identifying this point as a linear separator, and check whether it is 
in the version space. If not, proceed with the sampling until a consistent separa- 
tor will be selected. This process yields several problems, the most important of 
which is efficiency. Recall that the QBC algorithm assumes a lower bound g > 0 

® pick n normally distributed variables Q . . . and normalize them by the square root 
of the sum of their squares 




( 3 ) 
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for the expected information gain from queried samples. Let p < 1/2 be such 
that H{p) = g. Having k tagged samples, the probability to select a consistent 
separator is smaller then (1 — p)^. This implies that the expected number of 
iterations the Monte Carlo algorithm makes until it finds a desired separator is 
greater then (1 — If the total number of samples the algorithm uses is m, 

and k is the number of tagged samples, then the computational complexity is at 
least I7(m(l — Plugging in the expected value for k in the QBC algorithm, 
i.e. the Monte Carlo implementation results in a computational 

complexity exponential in g, d (the VC-dimension, i.e. n + 1 in our case) and 
also a depends on m^. Furthermore, if the version space decreases faster then it’s 
expected value, then finding a consistent linear separator will take even longer 
time. The algorithms we suggest in this paper work in time polynomial in n, g 
and depend on i.e., they are exponentially better in terms of the VC- 

dimension and g and also have better polynomial factors in terms of m. We also 
avoid the problem of rapid decrease in the size of the version space by employing 
a detecting condition. 

1.1 Mathematical Notation 

The sample space J7, is assumed to be a subset of 5R" and therefore a sample is 
a vector in 5R". A linear separator is a tuple {u, offset} where u is a vector in 5R" 
and offset G Ift. To simplify notation we shall assume that each sample x is taken 
from 3?"+^ forcing x\ = 1. Hence a linear separator is just a vector in Sfi^+^.The 
concept to be learned is assumed to be chosen according to a fixed distribution V 
over 3?"+^ . We denote by x'‘ a queried sample and t the corresponding tag, where 
C G { — 1, 1}. The version space V is defined to be V = {v|Vz((a;*, ?;) • C) > 0}. 
Let W = • a;L Then a vector u is a linear separator if Vi (V®,u) > 0. Using 

matrix notation we may further simplify notation by setting V* to be the i’th 
row of matrix A and writing V = {v\Av >0}. 

1.2 Preliminary Observations 

Upon receiving a new sample x, the algorithm needs to decide whether to query 
for a tag. The probability for labeling x with -|-1 is: = Prxi\v G U'*"] where T> 

is the distribution induced on the version space V and U+ = {v\Av > 0, {x, v) > 
0}. Similarly we define V~ and P~ which correspond to labeling x with —1. The 
QBC algorithm decides to query for a tag only when the two hypotheses disagree 
on a;’s label and this happens with probability 2P~^P~ . Thus P+ and P~ are all 
we need in order to substitute Gibbs oracle and make this decision. Normalizing 
||t)|| = 1, the version space of linear separators becomes subset of n dimensional 
sphere S'". Under the uniform distribution on S", the value of P~^ (and P~) can 

be obtained by calculating n -I- 1 dimensional volume: P+ = Now V, 

and V~ are convex simplexes 0 in the n + 1 dimensional unit ball. Having 

^ the term simplex is used to describe a conical convex set of the type A = {u G 
3ff"|Au > 0, ||u|| < 1}. Note that it is a nonstandrd use of this term. 
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and P , we can substitute the Gibbs oracle: Given a set of tagged samples 
and a new sample x, query the label of x with probability 2P~^P~ . 



1.3 Few Results about Convex Bodies 



In order to simulate the Gibbs oracle we seek efficient methods for calculating 
the volume of a convex body and uniformly sampling from it. Very similar ques- 
tions relating convex bodies have been addressed by Dyer et. al. Lovasz and 
Simonovits || and others. 



Theorem 2 (The Sampling Algorithm [19:]). Let K be a convex body such 
that K contains at least 2/3 of the volume of the unit ball (B) and at least 
2/3 of the volume of K is contained in a ball of radius m such that 1 < 
m < For arbitrary e > 0 there exist a sampling algorithm that uses 

0(n'*m^log^(l/e)(nlogn-|-log(l/e)) operations on numbers of size O(logn) bit- 



s and returns a vector v such that for every Lebesgue measurable set L in K: 



|Pr(ri G L) — 



Vol(L) 

Vol(K) 



< e 



The algorithm uses random walks over the convex body K as its method of tra- 
verse and it reaches every point of K with almost equal probability. To complete 
this result, Grotchel et. al. 0| used the ellipsoid method to find an affine trans- 
formation Ta such that given a convex body K, B Q Ta{K) C , Assume 

that the original body K is bounded in a ball of radius R and contains a ball 
of radius r then the algorithm finds Ta in 0(n^(| logr| -|- | logi?|)) operations on 
numbers of size 0{n‘^{\ logr| -I- | logi?|)) bits. Before we proceed, let us elaborate 
on the meaning of these results for our needs: When applying the sampling al- 
gorithm to the convex body V, we will get v G V. Since V+ and V~ are both 
simplexes, then they are Lebesgue measurable, hence | Pr(t> G V^) — P+| < e. 
Note that we are only interested in the proportions of V'^ and V~ and the use of 
the affine transformation Ta preserve these proportions. The sampling algorithm 
enables Lovasz and Simonovits to come out with an algorithm for approximating 
the volume of a convex body: 



Theorem 3 (Volume Approximation algorithm |9:]). Let K be a convex 
in 3?". There exists a volume approximating algorithm such that upon 
receiving error parameters e,S G (0, 1) and numbers R and r such that K contains 
a ball of radius r and is contained in a ball of radius R centered at the origin, 
the algorithm outputs a number C. such that with probability at least 1 — i5 



(1 - e) Vol{K) < C < (1 + e) Vol{K) (4) 

The algorithm works in time polynomial in \ logi?| -I- | logr|,n,l/e and log(l/5). 

For our purposes, this algorithm can estimate the expected information gained 
from the next sample. It also approximates the value of P'^ (and P~) and thus 

® A is assumed to be given using a separation oracle. 
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we may simulate the Gibbs oracle by choosing to query for a tag with probabil- 
ity 2P~^(1 — P~^). Both the sampling algorithm and the volume approximation 
algorithm require the values of R and r. Since in our case all convex bodies are 
contained in the unit ball B, then fixing R = 1 will suffice and we are left with 
the problem of finding r. However, it will suffice to find r such that ^ < r < r* 
where r* is the maximal radius of a ball contained in K . Moreover, we will have 
to show that r is not too small. The main part of this proof is to show that if 
the volume of V'^ is not too small, then r is not too small (lemma . Since we 
learn by reducing the volume of the version space, this lemma states that the 
radius decreases in a rate proportional to the learning rate. 



2 Modified Query By Committee Algorithms 

In this section we present two variants of the QBC algorithm: the first uses ap- 
proximation of the volume of convex bodies, while the second uses the technique 
for sampling from convex bodies. Both algorithms are efficient and maintain 
the exponential gap between labeled and unlabeled samples. QBC” is especial- 
ly interesting from computational complexity perspective, while the mechanism 
in the basis of QBC’ enables the approximation of the maximal a-posteriori 
(MAP) hypothesis in Poly{logm) time as well as direct access to the expected 
information gain from a query. 



2.1 Using Volume Approximation in the QBC Algorithm (QBC’) 

Every Sample x induces a partition of the version space V into two subset- 
s and V~ . Since V = V'^ U V~ and they are disjoint, then Vol(V) = 

Vol(V+) -I- Vol(V“) and ■ Hence, approximating these 

volumes results in approximation of P~^ that we can use instead of the original 
value. In order to use the volume approximation algorithm as a procedure, we 
need to bound the volumes of and V~ , i.e. find balls of radii r~^ and r~ 
contained in and {V~) respectively. If both volumes are not too small then 
the corresponding radii are big enough and may be calculated efficiently using 
convex programming. If one of the volumes is too small then we are guaranteed 
that the other one is not too small since Vol(V) is not too small (lemma Q. It 
turns out that if one of the radii is very small then assuming that the corre- 
sponding part is empty (i.e. the complementary part is the full version space) is 
enough for simulating the Gibbs oracle (lemmaQ)- The QBC’ algorithm follows: 



Given a, f3 > 0 and let k be the current number of labeled samples, 

90,2,32 , Ok — 4,r2(fe-|-l)2 Ufe — 48 i Mfe — 



define tk = 



n 



1. Call Sample to get a random sample x. 

2. Use convex programming to simultaneously calculate the values of and 
r~ , the radii of balls contained in U+ and V~ . 
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3. If min{r'^ ,r~) < ^k{iTiax{r~^ , r~ ))'^ 

Then assume that the corresponding body is empty and goto 6. 

4. Call the volume approximation algorithm with r~^ (and r~) to get (and 

) such that 

(1 - efc)Vol(C+) < C+ < (1 + efe)Vol(y+) 
with probability greater then 1 — 

5. Let P+ = , with probability 2P+(1 — P+) call Label to get the correct 

label of X. 

6. If Label was not used for the last tk consecutive samples, 

Then call the sampling algorithm with e = ^ to give an hypothesis and 
stop. Else return to the beginning of the loop (step 1). 

With probability greater then 1 — P the time complexity of QBC’ is m times 
Poly{n, k, (for each iteration). The number of samples that the algorithm 

uses is polynomial in the number of samples that the QBC algorithm uses. 
Furthermore, the exponential gap between the number of samples and number 
of labeled samples still holds. 

2.2 Using Sampling in the QBC algorithm (QBC”) 

Another variant of QBC is the QBC” algorithm which simulate the Gibbs oracle 
by sampling, almost uniformly, two hypothesis from the version space: 

Given a, /? > 0 and let k be the current number of labeled samples, 
define Pk = ^ 2 (lh )2 , tk = , Sk = ^ , f min(efe, 1 - Cfc) 

1. Call Sample to get a random sample x. 

2. Call the sampling algorithm with and r to get two hypotheses from the 
version space. 

3. If the two hypotheses disagree on the label of x 

Then use convex programming to simultaneously calculate the values of r'*" 
and r~ , the radii of the balls contained in and V~ . 

4. If min{r'^ ,r~) > pLk{max{r'^ 

Then call Label to get the correct label and choose r“*" (or r~), the radius 
of a ball contained in the new version space, to be used in step 2. 

5. If Label was not used for the last tk consecutive samples. 

Then call the sampling algorithm with e = ^ to give an hypothesis and 
stop. 

Else return to the beginning of the loop (step 1). 

QBC” is very similar to QBC’ but with one major difference: calculating new 
radius is conducted almost only when the version space changes and this happens 
only after querying for a tag. Hence each iteration takes O {rP log^ ( 1/e) (n log n + 
log(l/e)) operations and an extra Poly{n,l/e,l/S,k) is needed when a tag is 
asked. Due to the exponential gap between the number of samples and number 
of tagged samples, this algorithm may be attractive for practical use. 
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3 Deriving the Main Resnlts 

We start the analysis by presenting a useful lemma which bounds the radii of 
a ball contained in a simplex as a function of its volume. The algorithms for 
estimating the volume of a convex body, or sampling from it, work in time 
polynomial in logr (i? = 1 in our case). The following lemma, gives a lower 
bound on the size of r, hence will be useful for analyzing the time-complexity of 
our algorithms. 

Lemma 1. Let K be a eonvex body eontained in the n- dimensional unit ball B 
(assume n > IJ. Let v = Vol{K) then there is a ball of radius r contained in K 
such that 

, 1 Vol{K) vn^) 

n Vol{B) ri7r”/^ 

Proof. We shall assume that K is given by a finite set of linear inequalities with 
the constraint that all the points in K are within the unit ball. Let r* be the 
supremum of the radii of balls contained in K, and let r > r*. We denote by dK 
the boundary of K. Then dK has a derivative at all points apart from a set of 
zero measure. 

We construct a set S by taking a segment of length r for each point y of dK such 
that dK has a derivative at y. We take the segment to be in direction which is 
orthogonal to the derivative at y and pointing towards the inner part of K (see 
figure nj. 




Fig. 1. An example of the process of generating 
the set S (the gray area) by taking orthogonal 
segments to dK (the bold line). In this case, the 
length r (the arrow) is too small. 



The assumption that r > r* implies AT C S' up to a set of zero measure. To show 
this we look at the following: Let x £ K, then there is f which is the maximal 
radius of a ball contained in K centered at x. If we look at the ball of radius f 
than it intersects dK at least at one point. At this point, denote y, the derivative 
of dK is the derivative of the boundary of the ball (the derivative exists) . Hence 
the orthogonal segment to y of length f reaches x. Since f < r* < r then the 
orthogonal segment of length r from y reaches x. The only points that might 
not be in S are the points in dK where there is no derivative, but this is a set 
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of measure zero. Therefore 



Vol(S') > Vol(if) = V 

But, by the definition of S we can see that 



Vol(S') < 



r\ = rYo\{dK) 



( 6 ) 



( 7 ) 



IdK 



Note that Vol(i9if ) is n — 1 dimensional volume. Hence, we conclude that v < 
rVol{dK) or r > This is true for every r > r* therefore 



* \ 
r > 



Yo\{dK) 



( 8 ) 



It remains to bound the size of Vol(i9if). In appendix C we show that the con- 
vex body with maximal surface is the unit ball itself (this result can be obtained 
from the isoperimetric inequalities as well). Since the volume of n-dimensional 

n n/2 n — 1 n/2 

ball is f. the n — 1 dimension volume of its surface is „+2 Substitut- 
^ \ 2 > ^ \ 2 ^ 
ing Vo\{dK) with the volume of the surface of the unit ball in equation (0 we 

nir"/2 ■ 



conclude that r* > 



Remark 1. Perhaps the lemma is known, but we could not find a reference. 
Eggleston 0 (also quoted at M. Berger’s book Geometrie, 1990) gives a rather 

difficult proof of the lower bound r > for n odd (for even n the 

bound is somewhat modified) . The two lower bounds seem unrelated. Moreover, 
width(if) seems hard to approximate by sampling. 

Convex programming provides the tools for efficiently estimating the radius of 
the ball contained in a simplex as shown in the statement of the next lemma 
(proved in Add. IXt: 

Lemma 2. Let K be a simplex contained in the unit hall defined by k inequali- 
ties. Let r* be the maximal radius of a ball contained in K . Using the ellipsoid 
algorithm, it is possible to calculate a value r, such that r* > r > r*/4, in time 
which is Poly{n, |logr*|,/c). 

We are now ready to present the main theorem for each variant of the QBC 
algorithm. However, since the difference in the analysis of the two algorithms 
provides no further insight we chose to describe here only the proof of QBC” 
which seems more adequate for practical use. The interested reader may find a 
detailed proofs for both algorithms in the comprehensive version Q . 

Theorem 4. Let a,j3 >0, let n > 1 and let c be a linear separator chosen uni- 
formly from the space of linear separators. Then algorithm QBC” (as described 
in section T^i.iil) returns an hypothesis h such that 



Prc,h[Prx[c{x) yf h{x)\ > a] < I - f 



( 9 ) 
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using mo = (n, 1/a, 1/(3)^^^'^ samples and ko = 0{nlog labeled samples. 
With probability greater then 1 — (3, the time complexity is m (the number of 
iterations) times Poly{n,k, (the complexity of each iteration). 

Proof. Our proof consists of three parts: the correctness of the algorithm, it’s 
sample complexity and computational complexity. We base the proof on three 
lemmas which are specified at App. El 

In lemma 0] we show that if q is the probability that QBC will ask to tag a 
sample x and q is the probability that QBC” will ask to tag the same sample 
then \q — q\ < 4efe. Using this result, we show in lemma Elthat if the algorithm 
didn’t ask to tag more then tk consecutive samples and V is the version space 
then 

Prc,/isv[Pra:[c(a;) yf h{x)] > a] < P/2 (10) 

therefore if we stop at this point and pick an hypothesis approximately uniformly 
from V with picking accuracy P/2, we get an algorithmic error of no more then 
P and this completes the proof of the correctness of the algorithm. 

We now turn to discuss the sample complexity of the algorithm. Following Freund 
et. al. 0 we need to show that at any stage in the learning process there is a 
uniform lower bound on the expected information gain from the next query. 
Freund et. al. showed that for the class of linear separators such a lower bound 
exists, when applying the original QBC algorithm under certain restriction on 
the distribution of the linear separators and the sample space. In corollary |2| we 
show that if g is such a lower bound for the original QBC algorithm then there 
exists a lower bound g for the QBC” algorithm such that g > g — 4cfc. Since 
Cfe < o.P we get a uniform lower bound g such that g > g — AaP, so for a/3 
sufficiently small g > g/2. I.e. the expected information QBC” gain from each 
query it makes, is ’’almost” the same as the one gained by QBC. 

Having this lower-bound on the expected information gain and using the aug- 
mented Theorem Q], we conclude that the number of iterations QBC” will make 
is bounded by Poly(n, P, while the number of queries it will make is less 
then Poly(n,log i,log i, i). 

Finally we would like to discuss the computational complexity of the algorithm. 
There are two main computational tasks in each iteration QBC” makes: First, 
the algorithm decides whether to query on the given sample. Second, it has to 
compute the radii of balls contained in U+ and V~ . 

The first task is done using twice the algorithm which approximate uniform 
sampling from convex bodies in order to obtain two hypotheses. The complexity 
of this algorithm is polynomial in | log efe|, n, | log r| where is the accuracy we 
need, n is the dimension and r is a radius of a ball contained in the body. In 
lemmaQ we bound the size of r by showing that if the algorithm uses m samples 

then with probability greater then 1 — /3, in every step r > ■ Since we are 

interested in log r this bound suffice. 

The other task is to calculate r~^ and r~ , which is done using convex program- 
ming as shown in lemma |3 It will suffice to show that at least one of them is 
“big enough” since we proceed in looking for the other value for no more then 
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another log^fc iterations (a polynomial value): From lemma 0 we know that 



Vol(V) > 






ri/2 



r(^) 



at every step. At least one of V'^ or V has volume of 



at least . Using the lower bound on the volume of V and lemma 0 we 

conclude that the maximal radius r > . 

We conclude that with probability greater then 1 — f3 the time-complexity of each 
iteration the QBC” algorithm is Poly(n, fc, log log ^). In the final iteration, 
the algorithm returns an hypothesis from the version space. Using the sampling 
algorithm this is done in polynomial time. Combined with the fact that there 
are at most toq iterations the statement of the theorem follows. □ 



4 Conclusions and Further Research 

In this paper we presented a feasible way to simulate the Gibbs oracle pick- 
ing almost uniformly distributed linear separators and thus reduce Query by 
Committee to a standard learning model. To this purpose, we used convex pro- 
gramming and formed a linkage to a class of approximation methods related to 
convex body sampling and volume approximation. These methods use random 
walks over convex bodies, on a grid which depends on the parameters e and 6, in 
a similar fashion to PAC algorithms. It seems that such random walks could be 
described using e-net terminology or alike. We thus suggest that this connection 
have further implication in Learning Theory. 

Freund et. al. [S| assumed the existence of the Gibbs oracle and essentially used 
the information gain only for proving convergence of the QBC algorithm (The- 
orem 0). The use of the volume approximation technique (i.e. QBC’) provides 
a direct access to the instantaneous information gain. This enable us to suggest 
another class of algorithms, namely Information Gain Machines, which make 
use of the extra information. Combining our results with the ones presented by 
Shawe- Taylor and Williamson m and McAllester m may allow to obtain gen- 
eralization error estimates for such information gain machines and make use of 
QBC’ ability to produce maximum a-posteriori estimate. 

The two algorithms presented have to estimate a radius of a ball contained in a 
convex body, which in our case is the version space V. Finding the center of a 
large ball contained in the version space is also an essential task in the theory 
of support vector machines (SVM) 0. In this case the radius is the margin 
of the separator. It is also clear that QBC is a filtering algorithm which seeks 
samples that are going to be in the support set of the current version space. This 
similarity implies a connection between the two paradigms and is the subject of 
our current research. 
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A Lemmas for QBC” 



Corollary 1. Let K be a simplex in the unit ball sueh that the maximal radius 
of the ball it eontains is r. then 



T.n^n/2 

W¥) 



< Vol{K) < 



W¥) 



( 11 ) 



Proof. The lower bound is obtained by taking the volume of the ball with radius 
r contained in K, and the upper bound is a direct consequence of lemma d □ 
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Proof (of lemma |^. Let A be the matrix representation of the k inequalities 
defining K . Given a point x € K, we would like to find r*, the maximal radius 
of a ball centered at x and contained in K . Any ball contained in K and centered 
at X is also contained in the unit ball (which is centered at the origin). Hence, 
its radius, r, must satisfy ||x|| + r < 1. The ball with the maximal radius meets 
the boundary of K and at these points the boundary of K is tangent to that 
ball. If the boundary is defined by Aiy = 0, then the minimal distance between 
this boundary and the point x is given by \{Ai,x)\ = \Aix\ (assuming Ai is 
normalized such that ||Ai ||2 = 1). Since x G K then \Aix\ = AiX, which implies 
that for any ball with a radius r centered at x and contained in K, \/i AiX > r. 
Else, the ball meets the spherical boundary thus | |a;| | + r > 1. This last discussion 
suggest that finding r* may be expressed as an optimization problem 

r* = argmax{r|3a: s.t. ||a:|| + r < l,Ax > r} (12) 

r 

It is easy to see that this is a convex programming problem: Fix r and assume 
that X does not satisfies one of the conditions which defines the optimization 
problem. If there exists i such that A = AiX < r then the hyper-plane defined 
by {y\Aiy = A} is a separating hyper-plane. Otherwise, if ||a;|| -I- r > 1 then the 
hyper plane defined by {y\{y — x,x) = 0} is a separating hyper plane (this is the 
orthogonal hyper plane to the segment from the origin to x). 

To conclude we need to show that the ellipsoid algorithm can do the job effi- 
ciently. First notice that r* is always bounded by 1. At lemma 0 we were able 
to show that r* is not worst then exponentially small. For our purposes, finding 
an r such that r > r*/4 will suffice. Note that if r > r*/4 then the volume of a 
ball with the radius r centered at x and contained in K is at least (^)"'Vol(H) 
where Vol(H) is the volume of the n dimensional unit ball. Hence, it is not worst 
than exponentially small in log(r*) and n. Since r* is not too small, efficiently 
of the ellipsoid algorithm is guaranteed. □ 



Lemma 3. Let K\, be two convex simplexes and let ri, V 2 be the maximal 
radii of corresponding contained balls. For i = 1,2 let Vi = Vol{Ki). Define 



p = 



VI 

VI+V2 ' 



if Ti < then p < pL. 



wmm n / 2 

Proof. From corollary 0 it follows that vi < ^((((( 2 / 2 ) ^2 > 

fore 

(d < am. 

V2 ~ rtf 



../2 



r(n+ 2 / 2 ) 



. There- 



(13) 



Let ^ < 1 . Substituting ri < we conclude that 



p = 



Vl 

Vi + V2 



Vl 

< — < M 



V2 



(14) 

□ 
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Lemma 4. For any sample x, let q he the probability that the QBC algorithm 
will query for a tag and let q be the similar probability for QBC”. Then 

k-g|<4efe (15) 

Proof. We start by analyzing the case that the radii size condition of QBC” is 
satisfied, i.e. mw(r+,r“) > p,k{max{r'^ ^r~))^ . Let p = Pr„gv[u(a^) = 1], then 
q = 2p{l-p). 

The QBC” algorithm samples two hypotheses, h\ and / 12 , from the version-space 
V. The sampling algorithm guarantees that 

iPvh.Mhi e 4"+] - Pr[‘l^+]| < (16) 

and the similarly for V~ and /i 2 . Denote a = Pr?ijgv[^i G P'*’] and b — 
Pr?i 2 gv [^2 G y~]- Since h\ and /i 2 are independent random variables then 
q = 2ab. Therefore, in order to bound |(?— ?| we need to maximize |2p(l— p) — 2a6| 
subject to the restrictions |p — a| < and |(1 — p) — 6| < e^. It is easy to check 
that the maximum is achieved when |a — p| = and |6 — (1 — p)| = and 
therefore 

|2p(l — p) — 2ab\ < 2ck + 2e^ < 4cfc (17) 

We now consider the case where the radii ratio condition fails. Without loss 
of generality, assume r~^ < p,k{r~)^. From lemma El it follows that p < npLk 
and by the definition of pfc we get p < 2 min(efe, 1 — e^) which means that q < 
4min(efe, 1 — efe)(l — 2min(efc, 1 — Cfc)) < 4efc. Therefore, by defining g = 0 we 
maintain the difference \q — q\ < 4efc. □ 



Lemma 5. Let L be any samples filtering algorithm. For any sample x, let q 
be the probability that the QBC algorithm will query for a tag and let q be the 
corresponding probability for L and assume |g — g| < 7- Let g be a lower bound 
on the expected information gain for QBC. Then there exists a lower bound on 
the expected information gain for L, denoted by g, such that 

g>g-i (18) 

Proof. Let r{x) be the density function over the sample space X. Let p{x) be 
the probability that x is tagged 1 , i.e. p(x) = PrKgv[w(a:) = 1 ], then q(x) = 
2p{x){l —p{x)). Since p is a lower bound on the expected information gain for 
the QBC algorithm then 

g — r{x)q{x)TL{p{x))dx (19) 

J X 

since |g — g| < 7 , the expected information gain for L is bounded by 



/ r{x)q{x)TL{p{x))dx > / r{x)q{x)H{p{x))dx — ^ > g — ^ 



( 20 ) 



J X J X 

taking a close enough lower bound g for the leftmost term, the statement of the 
lemma follows. □ 
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Corollary 2. Let g he a lower bound on the expected information gain of the 
QBC algorithm. From lemmas^ orwiO it follows that there exists g > g — 
such that g is a lower bound on the expected information gain of QBC”. 



Lemma 6. Assume that after getting k labeled samples, algorithm QBC” does 
not query for a tag in the next tk consecutive samples. If c is a concept chosen 
uniformly from the version space and h is the hypothesis returned by QBC”, then 

Prc,h[Prx[c{x) ^ h{x)] > a] < P/2 (21) 



Proof. We define bad pair to be a pair of hypotheses from the version space that 
differ on more then proportion a of the samples. We will want the algorithm to 
stop only when the queries it made form an ^^-net, i.e. if two hypotheses 
are picked independently from the version space, the probability that they form 
a bad pair is less then We will show that if the algorithm did not make a 
query for consecutive samples, then the probability that the queries sampled 
do not form an — ^^-net, is bounded by /3fc/4. 

Let W = {(ft. 1 ,/ 12 ) \PT:x[hi{x) 7 ^ h 2 {x)] > a}. If Pr[W] < /3fe/4 then the proba- 
bility that (c, ft.) is a bad pair is bounded by /3fe/4 (when picked uniformly) .We 
would like to bound the probability that Pr[W] > /3fe/4 when QBC” didn’t query 
for a tag for tk at the last consecutive samples: 

If Pr[W] > /3fc/4, then the probability that the QBC algorithm will query for a 
tag is greater then a/3fc/4. From lemma E| we conclude that the probability that 
the QBC” algorithm will query for a tag is greater then o:/9fc/4 — 4cfc. Plugging 
in Cfc = we conclude that the probability that QBC” will query for a tag is 
greater then aPk/8. Therefore, the probability that it won’t query for a tag tk 
consecutive samples is bounded by (1 — aPk/8Y'° . Using the well known relation 
(1 — £)" < and plugging in tk = ^ h follows that 



_ aPk.f ^ i°g4//3fc 

(I )*'= < e 8 = e 



= glog(%) = 



Pk 



( 22 ) 



Hence, Pra,^^a, 2 _,,,[Pr[IU] > /3fc/4] < /3fe/4. Thus if the algorithm stops after tk 
consecutive unlabeled samples, the probability of choosing an hypothesis which 
forms a bad pair with the target concept is lower then Pk/2., since the probability 
of W being bigger then Pk/4is less then /3fe/4, and if it is smaller than /3fe/4 then 
the probability for mistake is bounded by Pk/4. Since Pk = it follows 

that 'Y2kYk = /3/2 and we get the stated result. □ 



Lemma 7. Let a > 0 and let m be the number of calls to the Sample Oracle 
that QBC” (or QBC’) makes (assume m > n) then the following holds with 
probability greater then I — 2““.- 

Each intermediate version space the algorithm generates has a volume greater 

than — ; o.nd there is a ball of radius greater than Y"'' 

in it. 



contained 
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Proof. The final version space, the one that is being used when the algorithm 
stops, is the smallest of all intermediate version spaces. Moreover, if the algorithm 
was not filtering out samples, but querying labels for all samples, then the final 
version space would have been smaller. Since we are interested in the worst case, 
we will assume that the algorithm did not filter out any sample. 

Fix X a vector of samples, while is its first m samples. Let c be a concept 
chosen from f2. X"^ divides 17, the set of all concepts, to equivalence sets, two 
concepts are in the same set if they give the same label to all m samples in 
X"^. If 17 has a VC-dimension d, we know from Sauer’s lemma that the number 
of equivalence sets is bounded by (^) . Using the distribution Prc over 17 it 
follows that 



Pl'c 



log Pr [C] > a + d log 



< 



(23) 



where C is the equivalence set of c. We now turn to discuss the special case of 
linear separators and the uniform distribution over the unit ball, i.e. 17. Note 
that if V is the version space after getting the labels for X"’, then V = C. 
Therefore Pr)^] = where Vol(i?) is the volume of the n-dimensional unit 

ball. Using (ESI we get 



Pr. 



— log Vol(V) > a + dlog — log Vol(S) 

d 



< 2 " 



(24) 



Assume that — log Vol(V) < a+dlog log Vol(S) (this is true with probability 
greater then 1 — 2““), from lemma ^ we know that there is a ball in V with radius 
VolCV)/^^ Ii±2 

r such that r > ^ ' . We conclude that there is a ball in V with radius 

r such that 



r > 



^ d 
k em 



) voi(B)r(i^) 

„_ri/2 



(25) 



□ 



Hardness Results for Neural Network 
Approximation Problems 



Peter Bartlett^ and Shai Ben-David^ 

^ Department of Systems Engineering 
Australian National University 
Canberra ACT 0200, Australia 
Peter . BartlettOanu . edu . an 
^ Department of Computer Science 
Technion 

Haifa 32000, Israel 
shaiScs . technion .ac.il 



Abstract. We consider the problem of efficiently learning in two-layer 
neural networks. We show that it is NP-hard to find a linear thresh- 
old network of a fixed size that approximately minimizes the proportion 
of misclassified examples in a training set, even if there is a network 
that correctly classifies all of the training examples. In particular, for a 
training set that is correctly classified by some two-layer linear thresh- 
old network with k hidden units, it is NP-hard to find such a network 
that makes mistakes on a proportion smaller than c/fc® of the examples, 
for some constant c. We prove a similar result for the problem of ap- 
proximately minimizing the quadratic loss of a two-layer network with a 
sigmoid output unit. 



1 Introduction 

Previous negative results for learning two-layer neural network classifiers show 
that it is difficult to find a network that correctly classifies all examples in a 
training set. However, for learning to a particular accuracy it is only necessary 
to approximately solve this problem, that is, to find a network that correct- 
ly classifies most examples in a training set. In this paper, we show that this 
approximation problem is hard for several neural network classes. 

The hardness of PAG style learning is a very natural question that has been ad- 
dressed from a variety of viewpoints. The strongest non-learnability conclusions 
are those stating that no matter what type of algorithm a learner may use, as 
long as his computational resources are limited, he would not be able to predict 
a previously unseen label (with probability significantly better than that of a 
random guess). Such results have been derived by noticing that, in some pre- 
cise sense, learning may be viewed as breaking a cryptographic scheme. These 
strong hardness results are based upon assuming the security of certain cryp- 
tographic constructions (and in this respect are weaker than hardness results 



P. Fischer and H.U. Simon (Eds.): EuroCOLT’99, LNAI 1572, pp. 50-|^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



Hardness Results for Neural Network Approximation Problems 



51 



that are based on computational complexity assumptions like P ^ NP or even 
RP ^ NP). The weak side of these results is that they apply only to classes 
that are rich enough to encode a cryptographic mechanism. For example, under 
cryptographic assumptions, Goldreich, Goldwasser and Micali P show that it is 
difficult to learn boolean circuits over n inputs with at most p{n) gates, for some 
polynomial p. Kearns and Valiant improve this result to circuits of poly- 
nomially many linear threshold gates and some constant (but unknown) depth. 
However, such classes are too rich to be considered useful for learning purposes. 
Another line of research considers agnostic learning by natural hypothesis classes. 
In such a learning setting, no assumptions are made about the rule used to label 
the examples, and the learner is required to find a hypothesis in the class that 
minimizes the labeling errors over the training sample. If such a hypothesis class 
is relatively small (say, in terms of its VG-dimension), then it can be shown that 
such a hypothesis will have a good prediction ability. 

There are quite a few hardness results in this framework. The first type are re- 
sults showing hardness of finding a member of the hypothesis class that indeed 
minimizes the number of misclassification over a given labeled sample. Blum and 
Rivest P prove that it is NP-hard to decide if there is a two-layer linear thresh- 
old network with only two hidden units that correctly classifies all examples in a 
training sample. (Our main reduction uses an extension of the technique used by 
Blum and Rivest.) They also show that finding a conjunction of k linear thresh- 
old functions that correctly classifies all positive examples and some constant 
proportion of negative examples is as hard as coloring an n-vertex fc-colorable 
graph with 0{klogn) colors (which has since been shown to be NP-hard m)- 
DasGupta, Siegelmann and Sontag P extend Blum and Rivest ’s results to two- 
layer networks with piecewise linear hidden units. Megiddo m shows that it is 
NP-hard to decide if any boolean function of two linear threshold functions can 
correctly classify a training sample. 

The weakness of such results is that, for the purpose of learning, one can settle 
for approximating the best hypothesis in the class, while the hardness results 
apply only to exactly meeting the best possible error rate. 

Somewhat stronger are results showing the hardness of ‘robust learning’. A ro- 
bust learner should be able to find, for any given labeled sample, and for every 
e > 0, a hypothesis with training error rate within e of the best possible within 
the class, in time polynomial in the sample size and in 1/e. Hoffgen and Simon 
0 show that, assuming RP ^ NP, no such learner exists for some subclasses of 
the class of halfspaces. Judd HH shows NP-hardness results for an approximate 
sample error minimization problem for certain linear threshold networks with 
many outputs. 

One may argue, that, for all practical purposes, a learner may be considered 
successful once he finds a hypothesis that e approximates the target (or the best 
hypothesis in a given class) for some fixed small e. Such learning is not ruled out 
by ruling out robust learning. 

We are therefore led to the next level of hardness-of-learning results, showing 
hardness of approximating the best fitting hypothesis in the class to within some 
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fixed error rate. Arora, Babai, Stern and Sweedyk Q show that, for any constant, 
it is NP-hard to find a linear threshold function that has the ratio of the number 
of misclassifications to the optimum number below that constant. Hoffgen and 
Simon show a similar result. We extend this type of result to richer classes 
of neural networks. 

The neural networks that we consider have two layers, with k linear threshold 
units in the first layer and a variety of output units. For pattern classification, 
we consider output units that compute boolean functions, and for real predic- 
tion we consider sigmoidal output units. Both problems can be expressed in a 
probabilistic setting, in which the training data is generated by some probability 
distribution, and we attempt to find a function that has near-minimal expected 
loss with respect to this distribution (see, for example, 0). For pattern classi- 
fication, we use the discrete loss; for real estimation, we use the quadratic loss. 
In both cases, efficiently finding a network with expected loss nearly minimal is 
equivalent to efficiently finding a network that has the sample average of loss 
nearly minimal. In this paper, we give results that quantify the difficulty of these 
approximate sample error minimization problems. For the pattern classification 
problem, we show that it is NP-hard to find a network with k linear threshold 
units in the first layer and an output unit that computes a conjunction that has 
proportion of data correctly classified within c/k of optimal, for some constant 
c. We extend this result to two-layer linear threshold networks (that is, where 
the output unit is also a linear threshold unit). In this case, the problem is hard 
to approximate within c/k^ for some constant c. This latter result applies even 
when there is a network that correctly classifies all of the data. 

The case of quadratic loss has also been studied recently. Jones m considers 
the problem of approximately minimizing the sample average of the quadratic 
loss over a class of two-layer networks with sigmoid units in the first layer and a 
linear output unit with constraints on the size of the output weights. He shows 
that this approximation problem is NP-hard, for approximation accuracies of 
order 1/m, where m is the sample size. The weakness of these results is that the 
approximation accuracy is sufficiently small to ensure that every single training 
example has small quadratic loss, a requirement that exceeds the sufficiency 
conditions needed to ensure valid generalization. Vu m has used results on 
hardness of approximations to improve Jones’ results. He shows that the problem 
of approximately minimizing the sample average of the quadratic loss of a two- 
layer network with k linear threshold hidden units and a linear output unit 
remains hard when the approximation error is as large as , where c 

is a constant and d is the input dimension. The hard samples in Vu’s result have 
size that grows polynomially with d, so once again, the approximation threshold 
is a decreasing function of m. 

In this paper, we also study the problem of approximately minimizing quadratic 
loss. We consider the class of two-layer networks with linear threshold units in 
the first layer and a sigmoid output unit (and no constraints on the output 
weights) . We show that it is NP-hard to find such a network that has the sample 
average of the quadratic loss within c/k^ of its optimal value, for some constant 
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c. This result is true even when the infimum over all networks of the error on 
the training data is zero. One should note that our results show hardness for an 
approximation value that is independent of input dimension and of the sample 
size. 

All of the learning problems studied in this paper can be solved efficiently if 
we fix the input dimension and the number of hidden units k. In that case, 
the algorithm ‘Splitting’ described in im (see also 0) efficiently enumerates all 
training set dichotomies computed by a linear threshold function. 

2 Approximate Optimization Definitions 

A maximization problem A is defined as follows. Let be a non-negative 
objective function. Given an input x, the goal is to find a solution y for which the 
objective function mA{x, y) is maximized. Define opt^(x) as the maximum value 
of the objective function. (We assume that, for all x, mA{x, •) is not identically 
zero, so that the maximum is positive.) The relative error of a solution y is 
defined as (opt^(a;) — m^(a;, y))/opt^(a;). 

Our proofs use L-reductions (see jl /ll 2) 1. which preserve approximability. An 
L-reduction from one optimization problem A to another B is a pair of func- 
tions / and g that are computable in polynomial time and satisfy the following 
conditions. 

1. / maps from instances of A to instances of B 

2. There is a positive constant a such that, for all instances x of A, optg(/(a;)) < 
aopt^(a;). 

3. g maps from instances of A and solutions of B to solutions of A. 

4. There is a positive constant (3 such that, for instances x of A and all solutions 
y of f{x), we have OY>tA{x)-mA{x,g{x,y)) < [3 (opts(/(a:)) - mB{f{x),y)). 

The following lemma is immediate from the definitions. 

Lemma 1. Let A and B be maximization problems. Suppose that it is NP-hard 
to approximate A with relative error less than 6, and that A L-reduces to B with 
constants a and [3. Then it is NP-hard to approximate B with relative error less 
than 5 /{a (3). 

Clearly, this lemma remains true if we relax condition of the L-reduction, so 
that it applies only to solutions y of an instance f{x) that have relative error 
less than 6 /{a (3). 

For all of the problems studied in this paper, we define the objective function 
such that maxa; opt^(a;) = 1. With this normalization condition, we say that 
an L-reduction preserves maximality if opt^(a;) = 1 implies optg(/(a;)) = 1. 
(This is a special case of Petrank’s notion of preserving the ‘gap location’ in 
reductions between optimization problems.) The following lemma is also trivial. 

Lemma 2. Let A and B be maximization problems. Suppose that it is NP- 
hard to approximate A with relative error less than 6, even for instances with 
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opt^ (x) = 1 . If A L-reduces to B with constants a and (3, and the L-reduction 
preserves maximality, then it is NP-hard to approximate B with relative error 
less than 6/{a(3), even for instances with opt^(x) = 1. 

3 Results 

We first consider two-layer networks with k linear threshold units in the first 
layer and an output unit that computes a conjunction. These networks compute 
functions of the form f{x) — A?=i fi{^)t where each fi is a linear threshold 
function of the form fi{x) = sgn{wi ■ x — 6i) for some Wi S R", 6i S R. Here, 
sgn(o;) is 1 if a > 0 and 0 otherwise. Let denote this class of functions. 

Max A:- And Consistency. 

Input; A sequence S of labelled examples, (xj, yt) G {0, 1}" x {0, 1}. 

Goal: Find a function / in that maximizes the proportion of 

consistent examples, (1/m) |{* : f(xi) = yi}\. 

The condition opt^j^ consistency (-S') = 1 in the following theorem corresponds to 

the case in which the training sample is consistent with some function in 

Theorem 1. Suppose k > 3. It is NP-hard to approximate Max /c-And Con- 
sistency with relative error less than l/(238fc). Furthermore, there is a constant 
c such that even when = 1 it is NP-hard to approximate 

Max fc-AND Consistency with relative error less than c/l? . 

The class is somewhat unnatural, since the output unit is constrained to 

compute a conjunction. Let A be a set of boolean functions on k inputs, and let 
N^'^ denote the class of functions of the form /(x) = g{f\[x), . . . , /fc(x)), where 
g G F and fi, . . . , fk are linear threshold functions. 

We do not know how to extend TheoremQto give a corresponding hardness result 
for the class A^A’^ defined on binary inputs. However, we can obtain results of 
this form if we allow rational inputs. 

Max k-F Consistency. 

Input: A sequence S of labelled examples, {xi,yi) G Q" x {0,1}. 

Goal: Find a function / in that maximizes the proportion of 

consistent examples, {I/m) |{z : f{xi) = yi}\. 



Theorem 2. For fc > 3, there is a constant c such that for any class F of 
boolean functions containing the conjunction, it is NP-hard to approximate Max 
k-F Consistency with relative error less than c/k^ , even for instances with 

k-F Consistency (5) = 1. 

Next we consider the class of two-layer networks with linear threshold units in 
the first layer and a sigmoid output unit. That is, we consider the class of 
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real- valued functions of the form 

fix) = cr vji{x) + , 

where Vi € R, are linear threshold functions, and cr : R ^ R is a 

fixed function. We require that the fixed function cr maps to the interval [0, 1], 
is monotonically non-decreasing, and satisfies 

lim cr(o;) = 0, lim cr(o!) = 1. 

a — >- — oo a — >oo 

(The limits 0 and 1 here can be replaced by any two distinct numbers.) 

Max k-a Consistency. 

Input: A sequence S of labelled examples, (xi,yi) € Q" x ([0,1] n Q) 
Goal: Find a function / in that maximizes 1— (1/m)^™ ^(?/i — f{xi))^. 

Theorem 3. For fc > 3, there is a eonstant c such that it is NP-hard to ap- 
proximate Max k-a Consistency with relative error less than c/kf , even for 
samples with opt^,„ cons™v(>5') = 1- 

4 Reductions 

4.1 Learning with an AND output unit: Max fc-AND Consistency 
We give an L-reduction to Max fc-CuT. 

Max fc-CuT. 

Input: A graph G = {V, E). 

Goal: Find a colour assignment c \V ^ [k\ that maximizes the propor- 
tion of multicoloured edges, (l/|if|) |{(ui,U 2 ) G E : c{vi) yf c(u 2 )}|- 

We use the following result, due to Kann, Khanna, Lagergren, and Panconesi [El, 
to prove the first part of Theorem □ 

Theorem 4 (mi)- For k >2, it is NP-hard to approximate Max fc-CuT with 
relative error less than l/(34(fc — 1)). 

For the second part of the theorem, we need a similar hardness result for k- 
colourable graphs. The following result is essentially due to Petrank The- 
orem 3.3 in CHI gives the hardness result without calculating the dependence 
of the gap on k. Using the reduction due to Papadimitriou and Yannakakis im 
that Petrank uses in the final step of his proof shows that this dependence is of 
the form c/fc^. 

Theorem 5 ([18]). For k > 3, there is a eonstant c such that it is NP-hard to 
approximate Max fc-CuT with relative error less than cjkf , even for k-colourahle 
graphs. 
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Given a graph G = {V, E), we construct a sample S for a Max fc-AND Consis- 
tency problem using a technique similar to that used by Blum and Rivest |^. 
The key difference is that we use multiple copies of certain points in the training 
sample, in order to preserve approximability. 

Suppose |y| = n, and relabel V = {ui, . . . ,u„} C {0, 1}”, where Vi is the unit 
vector with a 1 in the position i and Os elsewhere. Let S consist of 

— a copies of (0, 1) (where 0 is the all-0 vector in {0, 1}"), 

— \{v €V •. (vi,v) e E}\ copies of (ui,0), for each Vi G V, and 

— one copy of {vi + Vj, 1), for each {vi, Vj) G E. 

The number a will be determined shortly. Clearly, [S'] = a -|- 3|if |. 

The proof of Theorem E relies on the following two lemmas. 



Lemma 3. For k>2, opt^,„ cons.stencv('S') < {k/{k - l))opt^,„ Fur- 
thermore, ii/opt,^„^.c„,(G) = 1 then = 1. 



Proof. Let c be the optimal colouring of V. Define hidden unit i as fi{x) = 
sgn(u>i • X — 0i), where 9i = —1/2 and Wi = {wi,i, . . . ,Wi^n) G R-"" satisfies Wij 
takes value —1 if c{vj) = i and 1 otherwise. Clearly, the a copies of (0,1) are 
correctly classified. It is easy to verify that each (ui,0) is correctly classified. 
Finally, every labelled example {vi Vj, 1) corresponding to an edge {vi, Vj) G E 
has 



hi{vi + Vj) 



0 if c{vi) = c{vj) = I 

1 otherwise. 



for / = 1, . . . , /c. Hence, 



opt 



Max fc-AND Consistency 



{S) = 



a 2\E\ |if |opt^( 

a + 3\E\ 



.{G) 



( 1 ) 



But it is easy to show that every graph has optjj^,^ j,_c„t(G) > 1 — 1/fc. (A random 
colouring has this expected proportion of multicoloured edges.) Hence, we can 
approximate the linear relationship by 



opt 



Max fc-AND Consistency 






{a + 3\E\){l-l/k) 
k \E\ 



k — 1 (/c — l)(a -I- 3|A I) 



(G) 

^P^Max fe-CUT (G) 



< 



k-l 



opt^ 



AG). 



Finally, it is clear from (EJ that opt^,^ ^.c^AG) = 1 implies opt^,^,, cons.stencv('S') = 

1 . 



Lemma 4. Set a = 3|A| -|- 1 and suppose that fc > 3. For any Max /c-And 
Consistency/5/ solution f with relative error less than l/(238fc), we ean find 
in polynomial time a Max fc-Cux/ G) solution g with eost Cg, and optj,^ k-cvAG)~ 

^3 — ^ (^P^Max fc-AND Consistency(*^) ^/) ■ 
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Proof. Given a Max A:-And Consistency(S) solution /, if 



3|g| 

a + 3|A|’ 



(2) 



then we know that /(O) = 1. To find a suitable choice for a, recall that 
opt„„fc.c„,(G) > 1 - l/k, so dU implies opt„„,.^„„coNs.sTE»ov(5') > 1 - 1/k. So 
if the relative error of c/ is less than 1/(238A:), we have 



Cf > 





1 \ 239 

238/fc ) ~ ~ 238fc' 



Choosing a = 3|if| + 1 means that 0 will be true for k > 3. 

Suppose that / = A^=i /*• Define a colouring g as follows: If f{vi) = 1, set 
g{vi) = 1, otherwise set g(vi) — min{j : fj(vi) = 0}, where the minimum is 
defined to be 1 if no fj{vi) = 0. For an edge (vi, Vj), if the edge is monochromatic 
(that is, g{vi) = g{vj)), then f{vi) = f{vj) = 0 implies f{vi+Vj) = 0. To see this, 
suppose that f{vi) = 0 and f{vj) = 0. Then g{vi) = g{vj) implies some I has 
fiivi) = fiivj) = 0. But since we also have /(O) = 1, we must have f{vi+Vj) = 0. 
For each example in S that corresponds to a vertex v G V, label the multiple 
copies with the pairs (v,e) for all edges e G if of the form e = {y,v'). Then 
there is a 1-1 mapping between edges e = (vi,Vj) and triples ((r;i,e), (t) 2 ,e),e). 
It follows that 



\E\ (opt 

Max fc-CuT (G) - Cg) < |5'| (opt 

Max fe-AND Consistency iS) - cf ) , 



and with the choice of a above, this is equivalent to 

^P^Max fc-CUT 



AG)-c,<^^^(opt 



\E\ 

< 7 (opt 



Max fc-AND Consistency 



Max fc-AND Consistency 



(S) - Cf) 

iS)-Cf). 



Hence, we have an L-reduction from Max fc-CuT to Max fc-AND Consistency, 
with parameters a = fc/(fc — 1) and (3 = 7, and this L-reduction preserves 
maximality. Together with Theorems E] and 0 this implies Theorem 0 



4.2 Learning with an arbitrary output unit: Max k-F Consistency 

We use the reduction from the proof of Theorem 0 and augment the input 
with two extra, rational, components, which we use to force the output unit to 
compute a conjunction. For V = {v\, . . . , Vn} C {0, 1}”, we let S consist of the 
following labelled points from {0, 1}" U x {0, 1}. 

~ a copies of ((0,0), 1), 

— \{v €V ■. (vi,v) G E}\ copies of ((0,?;i),0), for each vt G V, 

— one copy of ((0, Vi + Vj), 1), for each {vi, Vj) G E, 

— a copies of ((s, 0), 1) for s G Sin C Q^, and 
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— a copies of ((s, 0), 0) for s G Sout C Q^, 

where the sets Sm and Sout and the number a will be defined shortly. Here, the 
first three types of labelled examples are those used in the reduction described in 
the proof of Theorem P augmented with two rational inputs set to zero. The sets 
Sin and Sout both have cardinality 3fc. Each point in S[„ is paired with a point 
in Sout, and this pair straddles some edge of a regular /c-sided polygon in 
that has vertices on the unit circle centred at the origin, as shown in Figure H 




Fig. 1. The sets Sin and Sout used in the proof of Theorem^, for the case k = 5. 
The points in Sin are marked as crosses; those in Sout are marked as circles. 



(We call this pair of points a ‘straddling pair’.) The midpoint of each pair lies on 
some edge of the polygon, and the line passing through the pair is perpendicular 
to that edge. The set of 3k midpoints (one for each pair) and the k vertices 
of the polygon are equally spaced around the polygon. We use the weights and 
thresholds defined in the proof of Lemma0 augmented with appropriate weights 
for the two additional inputs. We choose the output unit as a conjunction and 
arrange the new hidden unit weights so that the intersection of the hidden unit 
decision boundaries with the plane of the two additional inputs coincide with 
the k sides of the polygon. It is now easy to verify the following lemma. The 
proof is essentially identical to that of Lemma 0 

Lemma 5. For k>3, opt„„ cons.stencv('S') < k/{k - Further- 

more, i/opt„,,,.c„,,(G) = 1 then opt^,, cons™v('S') = 1- 

Lemma 6. Set a = 6|E| and suppose that k > 2. // opt^,,, cons.stencv('S') = 1, 
then for any Max k-F Consistency solution f with relative error less than 
1/(13A:), we can find in polynomial time a Max k-C\JT(G) solution g with cost 
Cg, and 



OP^MAX fe-cuT(^) — 9(4fc + 1) (optjj^jj Cf) . 
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Proof. For a Max k-F Consistency(S) solution /, if 

1 - “ 

^ (6fc+ l)a + 3|F;|’ 



( 3 ) 



then we know that /(0,0) = 0 and /(0,s) is 1 for s G Sin and 0 for s G Sout- 
Condition 0 is equivalent to 

(1 - cf)m 

l-(l-c/)(6fc + l)’ 

which follows from a = Q\E\ when the denominator is greater than 1/2. The 
latter is equivalent to c/ > 1 — 1/(2(6A: + 1)), and this is true when the relative 
error condition of the lemma is satisfied. So, under the conditions of the lemma, 
the Max k-F Consistency(S) solution / correctly classifies the origin and the 
points straddling the polygon. Let a denote the distance between a point in 
S'™ U Sout and its associated edge. Clearly, since the points in {(s, 0) : s G Sin} 
are labelled 1 and those in {(s, 0) : s G Sout] are labelled 0, for every straddling 
pair described above, any function in that is consistent with these points 

has some hidden unit whose decision boundary separates the pair. It is easy 
to show using elementary trigonometry that there is a constant c such that, if 
a < c/k, no line in can pass between more than three of these pairs, and 
no line can pass between three unless they all straddle the same edge of the 
polygon. Since k lines must separate 3k straddling pairs, and the origin must 
be classified as 1, any function in that is consistent with the points from 

Sin U Sout is a conjunction of k linear threshold functions. 

We continue in the same way as the proof of Lemma 0 obtaining 

| 5 '| 

®P^Max — I Pi (®P^Max k-F Consistency ( *^ ) *"/ ) ’ 



Substituting jS”! = a{6k + 1) + 3\E\ = 9|i?|(4fc + 1) gives the result. 

It is easy to show that the result is also true if the components of vectors in Sin 
and Sout must be ratios of integers, provided the integers are allowed to be as 
large as cA:^, for some constant c. Hence, for each k, the number of bits needed 
to represent S is linear in the size of the graph G. 



These lemmas show that we have an L-reduction from Max fc-CuT for k- 
colourable graphs to Max k-E Consistency, with parameters a = k/{k — I) 
and (3 = 9(4fc + 1), and this L-reduction preserves maximality. Combining this 
with Theorem ^ gives Theorem |21 



4.3 Learning with a sigmoid output unit: Max k-cr Consistency 

We give an L-reduction from Max k-F Consistency to Max k-a Consis- 
tency, where F is the class of linear threshold functions. Given a sample 
S for a Max k-F Consistency problem, we use the same sample for the 
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Max k-a Consistency problem. Triviallj]], if opt^^ Consistency (“S') = 1 then 
opt„„fc.,coNs.sTENov(‘S') = 1. Furthermore, we have the following lemma. 



Lemma 7. For a solution f to Max k-a Consistency with cost Cf, we can 
find a solution h for Max k-F Consistency with cost Ch, and 



opt 



Max k-F Consistency 



(-S') - Cft, < i (opt 



Max fe-£7 Consistency 



(-S') - Cf) . 



Proof. Suppose that 

/ k 

f{x) = CT I ^ Vji{x) + Vo 
\i=l 

Without loss of generality, assume that (t( 0) = 1/2. (In any case, adjusting 
Vo gives a function a that satisfies inf {a : a{a) > 1/2} = 0, which suffices for 
the proof.) Now, if we replace cr(-) by sgn(-), we obtain a function h for which 
h{xi) yf Ui implies {f{xi) — yi)'^ > 1/4. It follows that 1 — c;i < (1 — c/)/4, as 
required. 




Thus, for the case opt„^,^ consistenoy(-S) = !> we have an L-reduction from Max 
k-F Consistency to Max k-a Consistency, with parameters a = 1 and 
(3=1 /4, and this L-reduction preserves maximality. Theorem 0 follows from 
Theorem 0 



5 Future Work 

It seems likely that the relative error bounds in Theorems 0 and 0 can be im- 
proved to c/k"^. This would be immediate if Theorem 0 were also true for k- 
colourable graphs. 

It would be interesting to extend the hardness result for networks with real 
outputs to the case of a linear output unit with a constraint on the size of 
the output weights. We conjecture that a similar result can be obtained, with a 
relative error bound that — unlike Vu’s result for this case m — does not decrease 
as the input dimension increases. 

It would also be worthwhile to extend the results to show that it is difficult 
to find a hypothesis that has expected loss nearly minimal over some neural 
network class, whatever hypothesis class is used. There is some related work in 
this direction. Theorem 7 in 0 shows that finding a conjunction of k' linear 
threshold functions that correctly classifies a set that can be correctly classified 
by a conjunction of k linear threshold functions is as hard as colouring a k- 
colourable graph with n vertices using k' colours, which has since been shown 
to be hard for k' = 0{kn‘^) for some a > 0 m The cryptographic results 

^ In this problem, the maximum might not exist since the restriction of the func- 
tion class to the set of training examples is infinite, so we consider the problem of 
approximating the supremum. 
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mentioned in Section Q do not have such strong restrictions on the hypothesis 
class, but only apply to classes that are apparently considerably richer than the 
neural network classes studied in this paper. 
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Abstract. We consider the following classes of quantified formulas. Fix 
a set of basic relations called a basis. Take conjunctions of these basic 
relations applied to variables and constants in arbitrary ways. Finally, 
quantify existentially or universally some of the variables. We introduce 
some conditions on the basis that guarantee efficient learnability. Fur- 
thermore, we show that with certain restrictions on the basis the classifi- 
cation is complete. We introduce, as an intermediate tool, a link between 
this class of quantified formulas and some well-studied structures in U- 
niversal Algebra called clones. More precisely, we prove that the com- 
putational complexity of the learnability of these formulas is completely 
determined by a simple algebraic property of the basis of relations, their 
clone of polymorphisms. Finally, we use this technique to give a simpler 
proof of the already known dichotomy theorem over boolean domains 
and we present an extension of this theorem to bases with infinite size. 



1 Introduction 

The problem of learning an unknown formula under some determined protocol 
has been widely studied. The inevitable trade-off between the expressive pow- 
er of a family of formulas and the resources needed to learn them has forced 
researchers to study restricted classes of formulas. Among them, propositional 
formulas have received particular attention. It is known that learning general 
propositional formulas is hard fllTT] in the usual learning models and some effi- 
ciently learnable subclasses of boolean formulas, especially inside CNF and DNF, 
have been identified (see 1 1 f 1 2\ for example). 

First-order logic is a formalism with superior expressive power, but it is not so 
well studied from the computational point of view (see fH] and futher references 
in that paper). An active line of research in predicate logics is, for instance. 
Inductive Logic Programming (ILP) [ 2012 1 ) . 

In a recent paper, Dalmau inspired by a well known dichotomy on the sat- 
isfiability of boolean formulas proved by Schaefer introduced a framework 
to study the learnability of quantified boolean formulas and proved a complete 
classification for finite bases. The main goal of this paper is to further continue 
that line of research by extending those results to domains of arbitrary size, since 
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quantification makes more sense when it can be applied to arbitrary variables 
and not neccesarily to boolean ones. 

As a main intermediate tool for our study we introduce a link with some well 
known algebraic structures, called clones in Universal Algebra. This approach 
has been introduced in a different context by Jeavons et al. uni and it has been 
succesful in the study of the Constraint Satisfaction Problem (CSP) Il3ll4ll5ll6l . 
We prove that the learning complexity of a family of quantified formulas over a 
finite domain is completely determined by its clone of polymorphisms. 

As a first application of this new technique we introduce two families of efficiently 
learnable classes, namely, coset generating (CG) and near-unanimity (NU) bases 
containing, as a particular case, the learnable classes for the boolean domain. 
Furthermore, we provide some evidence that these families of learnable formulas 
are complete. More precisely, we show that if we restrict the formulas in certain 
ways we obtain a dichotomy. Despite the fact that there exists a dichotomic 
classification for the boolean domain, a full dichotomy for arbitrary domains is 
not known and seems likely to be hard to be found, since the clone lattice that 
characterizes the learnability of quantified formulas is rather involved; actually 
it is uncountable. For the boolean domain, the clone lattice is simpler and has 
been completely characterized by Post m This description allows us to give 
an alternative proof of the dichotomy theorem in [Z] and to extend it to infinite 
bases. 

The positive learnability results are obtained using an apparently simple algo- 
rithm, called the Generating Set (GS) Algorithm. This algorithm exploits the 
intersection closure property of some representation classes. Learnable classes 
are efficiently learnable with equivalence queries in the model of exact learning 
with queries, as defined by Angluin This fact is rather striking since in all the 
dichotomic classifications the rest of the classes are shown not to be learnable 
even in the more powerful model of PAC-prediction with membership queries 
as defined by Angluin and Kharitonov |3| . Another characteristic feature of the 
learnable classes is that every concept in them can be described as the mini- 
mum concept containing a set of examples with size polynomial in the number 
of attributes (variables). This fact has some interesting consequences. First, the 
computational complexity of learning these classes only depends on the number 
of attributes but not on the length of the particular representation, strength- 
ening the dichotomy. Second, the total number of concepts with a determined 
number of atributes is a number singly exponential in the number of attributes. 
For space limitation some of the proofs are not included. We refer the reader to 
the full- version paper 0 for proofs missing in this paper and for further technical 
details. 

2 Formulas and Relations 

Let V = {xi,X 2 : . . .} be an infinite set of variables. Let D be a fixed finite set 
called the domain. An assignment is a vector in D*. If a; is a string, |a;| denotes 
its length. For any assignment t G D* and for any integer j < |t|, t[j] G D 
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denotes the jth component of t. A relation of rank k (or fc-ary relation) over D 
is a subset oi . 

We use the term formula in a wide sense, to mean any well-formed formula, 
formed from variables, constants, logical connectives, parentheses, relation sym- 
bols, and existential and universal quantifiers. 

Let S = {Ri, R 2 , ■■■} he any set where each Ri is a relation of rank ki. Ri denotes 
both the relation and its symbol. The set of quantified formulas with constants 
over the basis S, denoted by 3V-Formc(S'), is the smallest set of formulas such 
that: 

a. - For all i? G S' of rank k, R{yi, , yk) G 3V-Formc(S) where yi gVU D for 

1 < z < fc. 

b. - For all F,G G 3V-Formc(S), F AG G 3V-Formc(S). 

c. - For all F G 3V-Formc(S) and for all x gV, 3xF G 3V-Formc(S). 

d. - For all F G 3V-Formc(S) and for all x gV, \/xF G 3V-Formc(S). 

If we remove condition (d) in the previous definition we obtain a reduced class 
of formulas called existentially quantified formulas with constants over the basis 
S, denoted by 3-Formc(S). Furthermore, if we also remove condition (c) we 
obtain a more reduced class called formulas with constants over the basis S and 
denoted by Forme (S). 

If in the previous definitions we replace yi G V D hy yi G V in (a), we 
obtain the constant-free counterpart of the previous sets of formulas denoted by 
3V-Form(S), 3-Form(S), Form(S) respectively. 

Each formula F defines a relation [E] if we apply the usual semantics of first- 
order logic and the variables are taken in lexicographical order. This operator can 
be extended to sets of formulas: For every set of relations S we define Rel(5') = 
{[F] : F G Form(S')}. Similarly, we also define 3V-Rel(5), 3-Rel(S'), Relc'(<S'), 
3V-Relc(5'), and 3-Relc(5'). 



3 Generating Set Algorithm 

In this paper we consider two models of learning, both of which are fairly stan- 
dard: Angluin’s model of exact learning from queries P] and the model of PAC- 
prediction with membership queries as defined by Angluin and Kharitonov |2j. 
We also assume some familiarity with the prediction with membership reduction 
(see 0 for details). 

Most of the terminology about learning comes from 0. Strings over X = D* 
will represent both examples and concept names. A representation of concepts C 
is any subset of A x A. We interpret an element {u, x) of X x X as consisting of 
a concept name u and an example x. The example a: is a member of the concept 
u if and only if (u,x) G C. Define the concept represented by u as Kc{u) = {x : 
{u,x) G C}. The set of concepts represented by C is Kq = {Kc{u) : u G A}. 

We will focus on representation classes for formulas. These representation classes 
have a stratified structure, that is, every concept contains examples of the same 
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size. For any stratified representation class C we define itlc.n as the set of concepts 
in C with examples of length n. We say that a stratified representation class C 
is intersection-closed if: 



Vci, C2 e Kc^n => 3cs G Kc^n Kcic^) = Kc{ci) n Kc{c 2 ) 

When C is intersection-closed, then, for any set of examples H of the same length, 
we can consider the intersection of all the concepts in C containing H, denoted 
by {H)c- We will say that H is a generating set of the concept {H)c- 
Notice that closure under intersection depends only on the set of concepts Kq 
and does not depend at all on the particular representation class C. In fact, we 
can consider generating sets as an alternative representation class for the same 
collection of concepts. More formally, given a set of examples H, we say that an 
example x belongs to the class represented by iff a; G {H)c- 
A representation class has to be polynomial-time evaluable, that is, it has to be 
decidable in polynomial time if a vector belongs to the class. We will focus on 
representation classes fulfilling this condition. For the remainder of this section 
we will assume that we are dealing with polynomial-time evaluable representa- 
tion classes but we will have to take into consideration this property later when 
we present concrete examples. 

This representation class suggests an immediate learning algorithm using equiva- 
lence queries: Start with an empty set of generators and keep asking equivalence 
queries and adding vectors until the set is complete. We call this algorithm the 
Generating Set (GS) Algorithm and we state it for future reference. 



Algorithm GS 

while EQ{{H)c)= ’no’ do 

Let c be the conterexample, 

H ^HU{c}, 
end while, 
return {H)c- 

Algorithm GS can be applied to any intersection-closed representation class C. 
There is a canonical algorithm for learning intersection-closed classes in the PAC 
model m called the Glosure Algorithm: The output of this algorithm is always 
{H)c where H contains all the positive samples Algorithm GS is a direct 
adaptation of the Closure algorithm. Consequently, it is possible to extend the 
results in this paper to nested differences of intersection-closed classes using an 
approach similar to m- 

It is possible to convert this algorithm to a proper algorithm finding for ev- 
ery set of generators H, a concept c equivalent to {H)c, but this step can be 
computationally expensive. 
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Clearly, algorithm GS always finds the target concept, the only drawback is 
its time complexity, since it can require an exponential number of equivalence 
queries. In consequence, we are interested in characterizing the cases in which 
algorithm GS learns a representation class with a polynomial number of queries. 
First, we need the following definition: 

Definition 1. Let C he an intersection-closed representation class of concepts, 
and let u = {xq,xi, . . . ,Xm) he a list of vectors in D". If for every 1 < i < m, 
Xi ^ ({a^O) • ■ • ) a^i-i})c then we call u an additive sequence over C". 

Clearly, an additive sequence is just a different reformulation of a possible se- 
quence of counterexamples provided by algorithm GS. A representation class is 
learnable using algorithm GS if the size of every possible sequence of counterex- 
amples can be bounded by a polynomial in the size of the examples and the 
size of the representation class. We do not consider the size of the representation 
class. This choice make analysis simpler, since we obtain indepence from the par- 
ticular representation class. Furthermore, as we will see later (Section l^), there 
is some evidence that this restriction does not make any difference when we are 
dealing with quantified formulas. Formally, we say that a representation class C 
is polynomially hounded if it is intersection-closed and every additive sequence 
over C" has size polynomial in n. 

Theorem 1. Let C he a polynomially hounded representation class. Then C is 
polynomially learnahle with improper equivalence queries using algorithm GS. 
Furthermore, C is also learnahle with a polynomial number of proper equivalence 
queries (not necessarily in polynomial time). 

Finally, notice that every intersection-closed representation class contained in a 
polynomially bounded representation class is also polynomially bounded. 

4 Learning Subuniverses 

The following definitions are fairly standard (see |1!H for example). An algebra 
is an ordered pair {D,<F) such that D is a nonempty set, called the universe, 
and ^ is a set of finite operation^ over D. There are some standard ways to 
assemble new algebras from those already at hand. The chief tools we will use 
are the formation of subuniverses and the formation of direct powers. 

Let ip be an m-ary operation over D, and let if be a subset of D. We say 

that E is closed under p (or p preserves E) if and only if \/di, . . . , dm G 

E, pi^di , . . . , dm^ G E. 

Let {D,<P) be an algebra. A subset E of D closed under every operation in <P 

is called a subuniverse of {D,T>). Let {D,<P) be an algebra and n any positive 

^ Technically, <P is an indexed system of operations rather than a mere set of operations. 
However, we consider that presenting it as a set of operations is simpler for the 
purposes of this paper. 
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integer. The direct power of (Z3, <P), denoted {D, is the algebra (£>", & 

^}), where for every m-ary operation ip, is the function given by 

, . . . , Xm ) = {‘Pixill] , . . . , Xjn [l]), . . . ,ip{xi[n], . . . ,Xrn[n])) 

Direct power is the natural way to extend the algebra to tuples of elements. 
From now on, due to the correspondence between an algebra and its direct 
powers we will do some notation abuse and for instance we will say that an n- 
ary relation R over D is closed under an operation ip : D ^ D, meaning that R 
is closed under (/?”. Furthermore, we will say that a formula F over D is closed 
under ip if and only if [F’] is closed under p. Moreover, we will say that a set 
of relations (formulas) is closed under a operation p if every relation (formula) 
in the set is closed under p. Let p be an operation over D. Operation p is said 
to be exhaustive if its image is D. Operation p is said to be idempotent if it 
satisfies p{x, x, . . . ,x) = x. Idempotent operations are exhaustive. For any set 
of operations, we define Idem{<P) as the subset containing exactly the idempotent 
operations in <l>. 

It is straighforward to see from the definition that the property of being closed 
under some operation is preserved by some of the classes of relations described 
in Section El 

Lemma 1. Let S be a set of logical relations closed under some operation p over 
D. Then, 3-Rel{S) is also closed under p. Furthermore, if p is exhaustive then 
3\/-Rel{S) is closed under p. Furthermore, if p is idempotent then 3y-Relc{S) 
is closed under p. 

The intersection of any collection of subuniverses of an algebra {D, <F) is again a 
subuniverse. In consequence, we can define generating sets for subuniverses in an 
analogous way to the previous section. This applies as well to direct powers. More 
precisely, let iF be a subset of D”, the intersection (iF)(D,< 5 ) of all subuniverses 
of the direct power {D, containing iF will be called the subuniverse generated 
by H. 

Consequently, a reasonable way to represent subuniverses is using generating 
sets. We call the class of subuniverses of some direct power (F),<F)" rep- 

resented by sets of generators. Since algebras define representation classes, con- 
cepts introduced in the previous section are applicable to algebras. So, we will 
say that an algebra (F), <T) is polynomially evaluable or polynomially bounded if 
so is 

We can turn these results in algebras into results about quantified formulas by 
considering closure operations. The next theorem summarizes the results. 

Theorem 2. Let F be a class of formulas closed under a polynomially bounded 
algebra (F?,<F). Then the following conditions hold: 

(a) F, Form([F]), and 3-Form([F]) are polynomially learnable with improper 
equivalence queries, 

(b) Form{[F]) and3-Form{[F]) are learnable with a polynomial number of proper 
equivalence queries (not necessarily in polynomial time). 
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(c) If every operation in is exhaustive, then conditions (a) and (b) are also 
satisfied by -Form{[T\) . 

(d) If every operation in <P is idempotent, then conditions (a) and (b) are also 
satisfied by Formc{[T\), 3-Formc{[T\), and Ji- Forme {[iF])- 

Proof. Condition (a) is a direct consequence of the definition of polynomially 
bounded algebra, Lemmas and the first part of Theorem O To show condition 
(b) just consider that both classes of concepts are intersection-closed and apply 
the second part of Theorem Q Condition (c) follows from the fact that closure 
under an exhaustive operation is preserved by universal quantification. Finally, 
condition (d) is due to the fact that closure under an idempotent operation is 
preserved by constantification, and that idempotentence implies exhaustivity. ■ 

In the remainder of this section we will introduce two families of operations, 
namely, coset generating operations and near-unanimity operations, such that 
the class of subuniverses of algebras containing one of these operations is poly- 
nomially bounded. In consequence, every class of quantified formulas with a 
basis preserved by some function in one of these families is polynomially learn- 
able using algorithm GS. We just define these families and present the positive 
learnability results. 



4.1 Coset Generating Operations 

Definition 2. An operation ip : ^ D, is called a ‘coset generating (CG) 

operation’ if for all x, y,z,u G D, 

1. if{x,x,y) = ip{y,x,x) = y 

2. ip{ip{x,y,z),z,u) = Lp{x,y,u) 

3. ip{u,z,(fi{z,y,x)) = (fi{u,y,x) 

It is easy to adapt the proof of Proposition 2.2 in m to show that the previous 
definition is equivalent to the existence of some group {D, ■) such that ip{x,y,z) = 
x ■ y~^ ■ z. 

If {D, •) is abelian, we have a particular kind of CG operation called an affine 
operation. Affine operations have been intensively studied in universal algebra 
(see PS] for example). It is well known (see |I3] for example) that for every 
affine operation over a finite set D of prime size, the subuniverses of its direct 
power are exactly the subsets of Z?" that can be expressed as a system of counting 
functions modulo \D\. The learnability of these formulas has already been shown 
in 0 using a similar strategy. The next result generalizes this result to arbitrary 
CG operations. 

Theorem 3. Let (p be a coset generating operation, and let F be a set of for- 
mulas closed underip, then C(^u,ip) is polynomial-time evaluable and polynomially 
learnable with proper equivalence queries. Furthermore, the class -Forme {[F]) 
is polynomially learnable with equivalence queries in Ci^e.ip)- 
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Proof. The proof contains mainly two results. We have to show that algebras 
containing CG operations are polynomially bounded and polynomially evaluable. 
First we prove that {D, ip) is polynomially bounded: 

Let u = {xq, . . . , Xm} be any additive sequence over (Z?, </?)”. By Lemma 3 in 
Section 5.7 in d, ({a:o,--- ,Xm}){D,(p) is a right coset of a subgroup J of the 
product group {D, •)”, so we can take xq as a representative of the coset and 
consider the set H = {a;o • Xq"^ ,x\ ■ Xq^ , . . . , Xm ■ Xq^} as a generating set for the 
subgroup J. From the fact that {xqj ■ ■ ■ ,Xm} is an additive sequence we have 
that H is independent in the sense that no element in H can be generated from 
the remainder. Therefore, the cardinality of the subgroup is at least 2"* < |Z?|", 
which gives a polynomial bound for the size of the basis m. 

The proof of the polynomial-time evaluability mimics the proof of Theorem 32 
in PI but it is not entirely straightforward. 

Let H = {xo, ■ ■ ■ ,Xm} be any set of vectors over U". Let {D,-) be the group 
associated with (p. As has been pointed out in the previous proof, (ZL)(D,y,) is 
a right coset of a subgroup J of the product group (ZZ, •)". We can take Xq 
as a representative of the coset and it is not hard to prove that {a;o • Xq^,X\ ■ 
XQ^,...,Xm ■ 3^(7^} is a generating set of the subgroup J. So, the problem is 
reduced to the problem of deciding whether a vector yo belongs to a group 
represented by a set of generators. 

Consider the tower of subgroups Gi, 0 < i < n where Gi is the subgroup of J 
obtained fixing the first i components to 0. A right coset representation of this 
tower can be efficiently constructed using algorithm 7 in [II llpl 
Now we will present a polynomial-time algorithm that, given a right coset rep- 
resentation for the tower of subgroups, decides whether yo G Go, proving the 
polynomial-time evaluability of CG operations. 

Let ai, ... ,ttr be the right coset representatives of Go in Gi. Let T = {oi : 1 < 
i < r, ai[l] = 2/0 [1]} be the subset of the representatives coinciding with yo in the 
first component. It is clear that if [Tj = 0 then yo ^ Go. On the other hand, we 
have jrj < 1, otherwise, let a^, aj be two different elements in T, then {uj ■ a~^) 
has a 0 as first component and, in consequence, belongs to Gi (incidentally, this 
fact proves that the size of the coset representation is not too large). 

Therefore, let ai be the unique representative in T. Since, yo G Go iff yo ■ G 

Gi, now we proceed with y\ = yo ■ and Gi as we did before for yo and Go 
and so on. If during some step j of this process we find that there does not exist 
any coset representative of Gj in G^+i coinciding with yj in the jT 1 component 

^ It is convenient to notice that algorithm 7 requires a polynomial procedure for 
testing membership in Gi, a condition that in our case is not satisfied, since we are 
precisely interested in this procedure for Go. However, this condition can be relaxed 
noticing that all the elements generated by algorithm 7 already belong to Go . In this 
case, membership in Gi is straightforward: just check that the first i components are 
equal to 0. 
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then we know that the answer is no. Otherwise, after n repetitions the answer 
is yes. m 



4.2 Near-unanimity Operations 

Definition 3. An operation —> D, is called a ‘near-unanimity (NU) 

operation’ if for all x,y G D, (p{x, y,y,...,y) = ip{y, x,y,...,y) = ■■■ = 
‘p{y,y,---,y,x) = y. 

A near-unanimity operation of rank 3 is also called majority operation. For near- 
unanimity operations it is even possible to prove proper learnability. 

Theorem 4. Let ip be a near-unanimity operation, and let !F he a set of for- 
mulas closed under ip, then is polynomial-time evaluable and polynomially 

learnable with proper equivalence queries. Furthermore, the class Ai -Forme {[if]) 
is polynomially learnable with proper equivalence queries. 

To prove this result we have to introduce some notation. Let R be an n-ary 
relation over D and let / = (ii, . . . , ik) he a, list of indices chosen from {1, . . . , n}. 
The projection tti{R) is defined to be the k-ary relation 

tti{R) = {(f[ii], . . .,t[ik])\t G R} 

The projection of a tuple t, 7r/(t) is defined similarly. 

Definition 4. An n-ary relation R over D is said to be r-decomposable if it 
contains all n-tuples t such that 7r/(f) G tti{R) for all lists of indices I, from the 
set {1, 2, . . . , n} with | J| < r. 

From ^ we have this useful property. 

Theorem 5. Every relation R over D, closed under a near-unanimity operation 
of arity r is {r — 1) -decomposable. 

Using this property, it is an easy task to prove that near-unanimity functions 
are polynomially bounded. 

Lemma 2. Let ip be a NU operation over a finite domain D. Then {D,p) is 
polynomially hounded. 

Proof. Let r be the arity of p and let u = {xq, . . . ,x„i} be any additive 
sequence over {D, p)'^. For every 1 < i < m — 1, the set Hi = ({xqj ■ ■ ■ j 
is (r — l)-decomposable. We know that Xi+i ^ Hi, so there exists some set 
of indices /, with |/| < r — 1 such that TTi{xi+i) ^ 7Ti{Hi). This implies that 
TTi{xi+i) 7 ^ for every 0 < j < i. The result follows from the fact that 

there are only a polynomial number of choices for ■ 

The intuition underlying this result is the following: every relation closed under a 
near-unanimity operation of arity r is decomposable as a conjunction of relations 



72 



Victor Dalmau and Peter Jeavons 



of fixed arity r — 1, therefore the problem of learning this class of relations is 
reduced to the problem of learning conjunctions of clauses of a fixed arity that 
can be solved using a similar approach to the one that is known for learning 
(r — 1)-CNF p. An empty basis can be regarded as a conjunction containing a 
full (r — l)-ary relation for every possible set with at most r — 1 indices. Every 
time we add a tuple to the basis, we remove in every relation the tuples falsified 
by the new tuple. We have a polynomial number of such tuples and with every 
addition to the basis we remove at least one. 

Finally, we prove the polynomial-time evaluability of near-unanimity operations. 

Lemma 3. Let ip he a NU operation over a finite domain D. Then {D,ip) is 
polynomially evaluable. 

Proof. Let r be the arity of ip, let H he a set of tuples over Z?" and let x be 
a tuple over D”. It is easy to see that x G {H)(D,ip) iff for every list of indices 
I over {1, 2, . . . , n} with |/| < r — 1, there exists some tuple u in H , such that 
TT[{u) = 7r/(x). Clearly, this condition can be checked in polynomial time. ■ 

Finally we have to prove that it is possible to get proper learnability for quanti- 
fied formulas. In fact, we establish a general conditions that guarantees proper 
learnability in some more cases. 

Lemma 4. Let T he a class of formulas closed under a near-unanimity oper- 
ation ip of arity r and closed under existential quantification, conjunction and 
renaming of variables. Then T is polynomial learnable with proper equivalence 
queries. 

Proof. We only need to prove that for every set H of models over ZZ", it is 
possible to find a formula equivalent to Z? = {H)cj^ in polynomial time. Consider 
for every set / = {A, . . . , A-i} of r — 1 indices, the relation tti{R). Clearly, this 
relation is obtained by some formula in iF (the set of relations of rank 

r — 1 in is finite and fixed, and therefore a list of formulas generating all of 
them can be precalculated). By the r — 1 decomposability of R the following 
formula is equivalent to R. 



Summarizing, 

Corollary 1. Let T he a class of formula closed under a near-unanimity op- 
eration ip. Then the following classes of formulas are polynomial-time learnable 
with proper equivalence queries: 3-Form{[T]) , Ri -Form{[tF]) , 3-Formc{[tF]) , and 
3\/ -Forme {[R]) . 




(R) : ■ ■ • j ) 
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5 Non-learnability results 

In previous sections we have used algebraic properties of relations to prove learn- 
ability results. More precisely, we used closure operations to show that algorithm 
GS can learn efficiently some sets of formulas. In fact, the link between the com- 
plexity of learning some classes of quantified formulas and closure operations is 
even tighter: closure operations of quantified formulas determine the learnability 
complexity. 

Definition 5. Let S be a set of relations over D. Define Pol{S) (polymorphisms 
of S) to be the set of all operations, ip, on D such that every relation in S is 
closed under p. 

The polymorphisms of a set have a known structure, (a) they contain all the 
projections, i.e., functions that return one of their arguments, and (b) they are 
closed under composition. Any set of operations satisfying these conditions is 
called a Clone m . The set of all clones over some finite set D forms a lattice. The 
set of polynomrphisms of a class of formulas determines its learning complexity. 

Theorem 6. Let S and So be sets of relations over a finite set D, 

1. If Pol(S) C PoI(Sq), and 3-Form{S) is polynomially predictable with mem- 
bership queries so is 3-Form{So). 

2. If Idem{Pol{S)) C Idem{Pol{So)), and 3-Formc{S) is polynomially pre- 
dictable with membership queries, so is 3-Formc{So)- 

Therefore from now on, it is justified to talk about the complexity of learning an 
algebra {D,<I>) meaning by that, the set of quantified formulas closed under all 
the operations in the algebra. That allows us to use the body of knowledge from 
universal algebra about clones. Theorem 0 also holds for some other learning 
models as for example: exact learning with equivalence queries, PAC-learning 
and PAC-prediction and the models obtained by adding membership queries to 
them. Since in this section we are interested in non-learnability results, we only 
state the theorem for the strongest model: PAC-prediction with membership 
queries . 

The original aim of this paper was to extend the dichotomic classification for 
the learnability of boolean formulas jZ) to larger domains. Unfortunately, the 2- 
element domain is rather particular and whereas the clone lattice for the boolean 
domain is countable and was fully identified by Post PI, the clone lattice for 
larger domains is more involved. In fact, it is known m that the clone lattice 
for \D\ > 3, contains uncountably many clones. This fact seems to indicate that 
it will be hard to find a complete classification for larger domains. 

However, in this section we will show that restricting the lattice of clones in 
some ways, i.e., considering only minimal clones and only plain clones we obtain 
a dichotomy. This fact provides some evidence that the known learnable families, 
i.e., CG and NU operations, are complete. We will consider clones containing 
only idempotent operations. This assumption is justified by the fact that clones 
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containing only idempotent operations correspond to the closure operations of 
classes of formulas containing constants (Theorem EJ . 

Atoms in the clone lattice are called minimal clones. Learnability of minimal 
clones is completely classified. 

Theorem 7. (Dichotomy Theorem for minimal clones) Let S be a set of rela- 
tions, such that Pol(S) = {D, Lp) is minimal. Then, if (p is a majority operation 
or an affine operation then 3\f -Forme (S) is polynomially learnahle with improper 
equivalence queries, otherwise 3-Formc{S) is not polynomially predictable even 
with membership queries under cryptographic assumptions. 

An algebra is called plain iff it is simple and has no nonsingleton proper subal- 
gebras. Plain algebras also have a complete classification. 

Theorem 8. (Dichotomy Theorem for plain clones) Let S be a finite set of 
relations, such that Fol{S) is plain. Then, if Fol{S) contains a near-unanimity 
operation or an affine operation then 3y-Formc{S) is polynomially learnable 
with improper equivalence queries, otherwise 3-Formc{S) is not polynomially 
predictable even with membership queries under cryptographic assumptions. 



6 Boolean Case revisited 

In this section we review the already known results about the learnability of 
quantified boolean formulas 0 under the perspective provided by its connection 
with clone theory, in order to simplify the proof and strengthen the results. 

We start by introducing the dichotomy theorem for the learnability of quantified 
boolean formulas with constants. The following definitions are from but the 
notation has been slightly adapted for convenience: 

A boolean relation is bijunctive if it can be expressed as a CNF where every 
clause has at most 2 literals. A boolean relation is said to be k-weakly monotone 
(resp. k-weakly antimonotone) (k > 3) if can be expressed as a CNF where 
every clause is either (z) the disjunction of at most k unnegated variables (resp. 
negated variables) or (ii) the disjunction of at most two literals with at most 
one negated (resp. unnegated) variable. A boolean relation is weakly monotone 
(WM) (resp. weakly antimonotone (WA)) if it is fc-ary weakly monotone (resp. 
fc-ary weakly antimonotone) for some k > 3. Finally, a boolean relation R is 
linear if it is logically equivalent to some system of equations over GF(2). 

These definitions are extended in a natural way to sets of relations. We say, 
for example, that a set of relations S is bijunctive iff every relation in S is 
bijunctive. In it is shown that under certain assumptions, quantified formulas 
are learnable iff the basis is bijunctive, weakly monotone, weakly antimonotone 
or linear. 

Diagram [D represents the idempotent Post lattice, that is, the lattice of clones 
on a 2-element set restricted to idempotent operations. Nodes in figure [D 
are labeled according to Post’s notation (see |2S| for a simple description of 
the lattice). Grey nodes denote clones generated by an infinite set of relations. 
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Fig. 1. Post Lattice restricted to idempotent operations 



Black nodes denote clones containing a near-unanimity or an affine function. 
The following table establishes the correspondence between these clones and the 
learnable classes in the boolean case. 



bijunctive 


{x, y, z) = xyV yzV zx 


A:- ary 'WM 


ipphl^xi, . . .,Xk+i) = V .. V Xj-i V Xj+i V .. V Xk+i) 


fc-ary WA 


(fiFk{xi, . . .,Xk+i) = A .. A Xj-i A Xj+i A .. A Xk+i) 


linear 


ipLAx,y,z) = x®y®z 



Every entry in the table, constituted by a class of basis C and an operation (p, 
has to be interpreted in the following way: a basis S belongs to the class C iff 
€ Pol(5). On the other hand, operation ipx generates clone X . 

Clearly, is an affine operation, and near-unanimity 

operations. So, the learnability of these classes follows directly from Theorems 0 
and0 As the lattice is ordered from bottom to top according to inclusion, the 
remainder of the black nodes contain also a near-unanimity or an affine operation 
inherited from some ancestor. 

The remainder of the clones are depicted as white nodes in figure [Q Let us 
now analyze them. Consider clone Pi, which is generated by the conjunction 
operation. It is known that every Horn formula is closed under conjunction and 
in particular so is [S' V y V z]. On the other hand, clone is generated by 
the disjunction operation, so it preserves all anti-Horn formulas, in particular 
[xV yV z]. Therefore, every clone generated by a finite set of relations and not 
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containing any near-unanimity or any affine operation preserves either [a: V y V z] 
or [x V y V z] and, as it is shown in (3 , the associated set of quantified formulas 
is not polynomially predictable with membership queries under cryptographic 
assumptions. This simple reasoning gives us an alternative proof of the involved 
case analysis in j3- 

A direct way to extend the boolean case in [ 7 ] is to classify the grey nodes in 
figure ^ in order to deal with bases containing an infinite number of relations. 
Obviously, we have to define some uniformity to deal with this case. We choose 
the size of the minimum CNF expression, since CNF formulas have conjuntive 
form (and therefore, are convenient to our framework that includes closure under 
conjunction) and have the maximum expressive power, since if we choose any 
other natural representation class strictly containing CNF we only obtain trivial 
non-learnability results. 

Definition 6. Let F be any formula in 3\/-Formc{S) where S ean eontain an in- 
finite number of boolean relations. We define the size of F , |F| as the length of the 
formula obtained by replaeing in F every oeeurrence of a relation . . . , Xm) 

by the minimum equivalent CNF. 

Theorem 9. (Generalized Diehotomy Theorem for quantified boolean formulas) 
For every set of boolean relations S, 

1. If S is bijunctive, linear, or k-weakly monotone or k-weakly antimonotone 
for some k > S, then 3y-Formc{S) is polynomially learnable with (proper) 
equivalence queries. 

2. Else if S is linear, then 3i-Formc{S) is polynomially learnable with (im- 
proper) equivalence queries. 

3. Else if S is monotone or antimonotone, then 3\/-Formc{S) is polynomially 
learnable with (improper) equivalence queries and membership queries. 

4-. Else if S is weakly monotone or weakly antimonotone then DNE is prediction 
with membership-reducible to 3\/-Formc{S). 

5. Else 3y-Relc{S) is not polynomially predictable with membership queries un- 
der the assumption that public key encryption systems secure against chosen 
ciphertext attack exist. 

It is interesting to point out is that whereas for the finite basis case membership 
queries are no help, i.e., for every finite basis the corresponding set of quantified 
formulas is either polynomially learnable with equivalence queries alone or not 
predictable even with membership queries, they turn out to be necessary to 
study the infinite case, since the class of monotone formulas is not known to be 
learnable with equivalence queries alone. 
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Abstract. We show that multiplicity automata (MAs) with size n and 
input alphabet E can efficiently be learned from n(n+ 1)1171+2 smallest 
counterexamples. This improves on an earlier result of Bergadano and 
Varricchio. A unique representation for MAs is introduced. Our algo- 
rithm learns this representation. We also show that any learning algo- 
rithm for MAs needs at least ^n^\E\ smallest counterexamples. Thus 
our upper bound on the number of counterexamples cannot be improved 
substantially. 



1 Introduction 

We study the problem of learning an unknown target eoncept f that is an element 
of a known set iF. A learning algorithm gets information about / only by posing 
queries to a teacher. If IF is a set of functions from a set A to a set A, typical 
queries are equivalence queries (EQs) and membership queries (MQs). For a 
function g : X Y (the hypothesis), the teacher answers to an YQ(g) query 
with “YES” if g is equal to the target concept /, and returns a counterexample 
X G X, i.e., an element x G X with f{x) ^ g{x), otherwise. For x G X the 
teacher answers a MQ(a;) query with the value f{x). 

If there is an ordering on X, we can also investigate equivalence queries that 
return smallest counterexamples {EQSC queries). When the learning algorithm 
poses an EQSC query for g : X ^ Y and g is not equal to the target concept 
/, the teacher answers with the smallest x G X for which f{x) yf g{x), and with 
the value f{x) G Y. 

The running time of an efficient learning algorithm must be polynomial in pa- 
rameters describing the complexity of the target concept and in the length of 
the longest counterexample ever returned by the teacher. 

Two interesting classes of functions for T are the functions accepted by DFAs 
and by MAs. Learnability of DFAs has been studied intensively. DFAs cannot be 
learned efficiently from MQs alone, or from EQs alone (see |2|). But Angluin has 
shown in that DFAs can efficiently be learned with MQs and EQs. This algo- 
rithm was improved by Rivest and Shapire in |2|. Ibarra and Jiang have shown 
in P) that, given the canonical ordering on E*, DFAs of size n can be learned ef- 
ficiently with at most \E\n^ EQSC queries. Birkendorf, Boker and Simon proved 
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in 0 that even \S\'n? counterexamples are sufficient. In 0, Bergadano and Var- 
ricchio have shown that MAs of size n can be learned with ©(n^lAI) EQSC 
queries. Our main result in Sect. 2|is that n{n + 1)| A| + 2 EQSC queries suffice. 
In addition, we show in Sect. 0that this result cannot be improved substantially. 
Our learning algorithm has similarities to the algorithm for learning DFAs from 
smallest counterexamples of Birkendorf, Boker and Simon. E.g., the set of min- 
imum representants of a DFA used in [S| corresponds to the set R{f) that we 
define in Sect. Olfor a recognizable function /. We also use elements of the al- 
gorithm of Bergadano and Varricchio, namely prefixtree representations and the 
solving of systems of linear equations. Their algorithm is competitive to our al- 
gorithm (with respect to the number of queries and running time) in the special 
case where the target MA maps only words of a fixed length to non-zero. Howev- 
er, they perform a reduction of the general case to this special case which makes 
their algorithm less efficient. 



2 Definitions and Notations 

The set of natural numbers 1,2,3,... is denoted by N. Let INq := 1M U {0}. 
An input alphabet is a finite set. In this article S is always an ordered input 
alphabet. E* is the set of all words x = x\ - ■ ■ Xn of a finite length \x\ = n with 
xi, . . . ,Xn G E. e is the word of length zero. The words Xi ■ ■ ■ Xn, xi ■ • • Xn-i, 

. . . , xi, £ are the prefixes of x, the words xi ■ ■ ■ X 2 ■ ■ ■ Xn, ■ ■ ■ , Xn, £ are the 
suffixes of X. A set M C E* is called prefix-closed if all prefixes of every element 
of M also belong to M. For a non-empty, prefix-closed set R C E* and c € E*\R 
there are unique r G R, a G E, z G E* so that c = raz and ra ^ R. r ist the 
longest prefix of c that is an element of R. We call (r, a, z) the R- decomposition 
of c. For two words x = X\ ■ ■ ■ Xn,y = yi - ■ ■ Vm G E*, we say that x is canonically 
smaller than y, denoted by a; < y, if n < m or if n = m and there is a number 
i € {1, . . . , n} such that x\ ■ ■ ■ Xi-i = yi • • ■ yi-i and Xi < yi. 

Throughout the article AT is a field. For n G IN, JL” denotes the vector space 
of n-dimensional column vectors x. x^ , the transposed vector of x, is a row 
vector. For a set M C AT”, span(M) is the linear hull of M. ei, . . . , e„ are the 
unit vectors, in particular ci = (1, 0, . . . , 0)^. For m, n G IN, AT™^" is the set of 
m X n matrices with entries from K. We also look at infinite matrices. For sets 
V,W, is the set of matrices with rows labeled by the elements of V and 

columns labeled by the elements of W. The rank rk(A) of a matrix A G 
is the supremum of the ranks of all finite submatrices. 

For functions f,g : E* ^ K, f g we define mincex(/, g) G E* to be the 
(canonically) smallest word x G E* such that f(x) g{x). For an expression A, 
we write 

r _ J 1, if A is true, 

^ ^ (0, otherwise. 

The function \l ^ K with xl{x) = [a; G A], a; G A*, is the characteristic 
function of the set L C E*. 
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For n C IN, /i : S* — > is a morphism of monoids if ■ ■ ■ Xm) = 

n{xi) ■ ■ ■ pL{xm) for all TO G INq, x\^ . . . ,Xm & S .li p, ■. S* ^ j^nxn jg a morphism 
of monoids and 7 G iC" is a vector, (/r, 7 ) is the multiplicity automaton (MA) of 
size n for the function / : E* — > K with f{x) = elp{x)^, x G S* . A function 
f : E* ^ K is called recognizable if there is a MA for /. 

For a function f : E* ^ K, = {f{xy))x, is the Hankel matrix of /. 

= {f{xy))yGS* is the the row of Hf for prefix x G E*. It is well known that 
a set L C E* can be represented by a DFA if and only if the number of distinct 
rows of the Hankel matrix of xl is finite. In [3j (see also Pd) it is shown that 
a function / : E* K is recognizable if and only if the Hankel matrix has 
finite rank. Then rk{H^) is the size of a minimal MA for /. 

3 Prefixtree Representations 

In this section, we look at another representation for recognizable functions, 
namely prefixtree representations. They were introduced in j^. In addition, we 
investigate prefixtree representations that respect the canonical ordering on E* . 
They will by very useful when we present the algorithm of Sect. 0 which learns 
from canonically smallest counterexamples. 

Definition 1. Let R C E* be a finite, non-empty, prefix-closed set, let a : R 
K be a function, and let GK,r, sGR, aGE, be numbers, such that 

ya,s _ _ gj g jg called a prefixtree representation ( of size 

\R\) of the function f : E* K with 

for m G INo, Xi,. . . ,Xm G E. 

If we think of the elements of R as states of the prefixtree representation, we can 
consider the numbers A’'“’® as weights for the transition from state r to state s 
when the next input letter is a. If ra is an element of R, the weight is one if 
ra = s and zero otherwise. 

Prefixtree representations are special MAs: Let {R, a, A) be a prefixtree repre- 
sentation of / and let n = |i?|. We can write R = {ri,...,r„} with r\ = e. 
Let p : E* ^ ynxn morphism of monoids with p{a) = 

a G E, and let 7 := (a{ri) , . . . , a{rn))^ G K^. Then {p,x) is a MA for /. 
Because of the requirement A’'“’® = [ra = s] if ra G R, we have f\R = a and 

zL = E (1) 

s^R 

for r G R, a G E. We say that (i?, a, A) is a canonical prefixtree representation 
if A’'“’® = 0 for all s > ra. Then 

zL = E ■ 

sGR,s<ra 



( 2 ) 
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We now show that for any recognizable function / : S* K there is a unique 
canonical prefixtree representation of size ik{H^). This representation is learned 
by the algorithm in Sect. 01 Let 

^(/) = {r € S* \ ^ span{Z/ | s G E*,s < r}} . 

R{f) is the set of prefixes r G E* for which the row for r of the Hankel matrix 
Hf is not a linear combination of the rows for prefixes that are smaller than r 
in the canonical ordering. R{f) is prefix-closed. For all r G E*, 

Zl G spanjzf I s G R{f),s < r} . (3) 

{Zl)s^R(f) is a basis of spanjZ/ | r G if*}, thus vk{H^) = \R{f)\. Using Q we 
can prove that R{f) C R for any canonical prefixtree representation (R, a, A) of 

/• 

Proposition 1. Every non-zero recognizable function f : E* K has a unique 
canonical prefixtree representation (R,a,X) of size rk{H^). 

Proof. Existence: Let R = R{f), a = f\n. Since (Zf)s^R is a basis of spanjZ/ | 
s G U*}, there are numbers A’'“’® GK,r,sGR, aGE, such that 

ZL = E (4) 

for r G R, a G E. Since these numbers are unique and because of ®, {R, a, A) 
is a canonical prefixtree representation. Because of (@|) and a = /|_r, (i?, a, A) is 
a prefixtree representation of /. There holds |i?| = |i?(/)| = rk{H^). 
Uniqueness: If (R,a,X) is a prefixtree representation of / with size rk{H^)^ we 
have R = R{f) (this follows from R{f) C R and \R{f)\ = rk{Hf) = \R\). The 
function a must be equal to f\R. Since {Zl)s^R(f) is linearly independent and 
because of dO, the numbers A’'“’® are unique. 

4 The MA Learning Algorithm 

In this section we prove the following theorem. 

Theorem 1. There is a learning algorithm that learns the unique canonical pre- 
fixtree representation of size n = rk{H^) of any recognizable function f : E* 

K with at most n{n-\- 1)|A7| -|- 2 EQSC queries in time 0(n^|if|). 

Before we look at the algorithm, we describe the variables we are going to use. 
k is an integer variable that counts the iterations of the main loop. For all fc, 
{Rk,ak, Xk) is a prefixtree representation of the k’ih hypothesis fk ■ E* K. 
Ck = EQSC(/fc) = mincex(/fe,/) denotes the smallest counterexample for fk; 
(rfc,Ofe,Zfe) is the iZ^-decomposition of Ck- For some words r G E* and some 
a G E, we store suffixes z G E* of counterexamples raz in lists L^a ■ 
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In each iteration of the main loop, the algorithm tries to solve a system of linear 
equations. If the system has a solution, it is not necessary to enlarge the current 
hypothesis and only the numbers s G Rk, are changed for the new 

hypothesis. Otherwise, the word rfeUfe is added to the set Rk in the procedure 
EXTEND. 



Algorithm 1. 

1: INIT; 

2: k := 1; 

3: Repeat: 

4: Let {rk,ak,Zk) be the i?fe-decomposition of c^; 

5: Append Zk to the list L^^ak ! 

6: While there exists a tuple of numbers {^s)sGRk,s<ra that solves all 

equations f{rkakz) = Y.s(^Rk,s<rkak with z element of 

{ \ra,s / 

r^ak, 

^s, ra = TkQkjS < rkRk, 
0 , ra = Tkak,s > rkak] 

8: c:=EQSC(Rfe,afc,A); 

9: If c > Cfc: 

19- (Rfc+l j , A/c+l ) ■ — (Rfc , CTfc j , Cfc-|_i . — c, 

II: Go to Line 14; 

12: Append the z G E* for which c = VkQkZ to the list Lr^ak'-, 

13: EXTEND; 

14: k := k + 1\ 

INIT: 

II: Co := EQSC(O); 

12: i?i := {e}; 

13: ai(e) := /(e); 

14: A“’® := 0 for a G X; 

15: For a G X initialize La as the empty list; 

16: Cl := EQSC(i?i,ai,Ai); 



EXTEND: 



El. Rk-\-\ . — Rk C 

Tzo { \ \ Oik{r), r e Rk, r ^ n 

E2: a*«(r) := ^ ^ for r £ /ir«; 

r,s e Rk,ra ^ TkRk, 
[ra = s], ra = rkak, 

E3: := ^ A^,'=“'=’‘Al“’^ r = s G Rk, 

tGRk 



[ 0, otherwise, 

E4: For a G X initialize as the empty list; 

E5. c/,,-1-1 . — EQSC(R/c+i , , Afc-|-i), 



for r, s G Rk+i, a G X; 
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If an EQSC query returns “YES”, the algorithm stops immediately and outputs 
the correct prefixtree representation (or a MA computed from this prefixtree 
representation) . 

By induction on k, we see that all {Rk,ak, Xk) and all (Rk,ak,X) are canonical 
prefixtree representations and that ak = for all k. (Note that if yf 0 
for r = rkttk, s G i?fc in Line E3, then there is a. t G Rk such that yf 0. 

Because of the induction hypothesis t < rkCik = r and s < ta < ra follow.) 

It is not evident how some of the lines of the algorithm can be executed. We 
make some remarks about those lines. In Line 4, Ck has a i?fe-decomposition 
since /fc(cfc) yf /(cfc) (thus Ck ^ Rk)- In Line 6, the numbers f{rkakz), f{sz) are 
known: For every z from the list TkdkZ is a counterexample returned by an 

EQSC query for some prefixtree representation (i?, a, A). The teacher has made 
/(rfcOfez) known and the values f{sz) can be computed with {R, a, A), since sz is 
smaller than rkUkZ. In Proposition El we prove that is a prefix of c in Line 
12. In Line 13, we know the value f{e) because the teacher has made /(cq) = /(e) 
known if cq = e, and we know that /(e) = 0 if cq yf e. In Line E2, either /(r^afe) 
has been made known by the teacher (if Zk = e) or rkak < VkttkZk = Cfc, and 
/(^’fcOfe) can be computed with the prefixtree representation {Rk,ak,Xk)- 

Proposition 2. Vkttk is a prefix of c whenever Line 12 is executed. 

Proof. We know c < Cfc. If c = Cfc, the assertion is evident. (However, the proof 
of Proposition 0 will show that this case never occurs.) Otherwise, let h be the 
function with prefixtree representation (i?fc,afe,A). Since mincex(/i, /) = c < 
Ck = mincex(/fe, /), we have c = mincex(/fe, /i). Since c ^ Rk, c has a Rk~ 
decomposition (r,a,z). Assume that ra yf r^Ofe. Then A”®’® = A™’® for s G Rk, 
and (0 yields 



fk{sz) = fkiraz) yf h{raz) 

s^Rk ,s<.ra 

= X^^’^h(sz) = Y Xl^-’^hisz) . 

sGRk,s<.ra sGKfc,s<ra 

Thus there is a s G Rk, s < ra, such that fk{sz) yf h{sz). Since sz < raz = c = 
mincex(/fc, h), this is a contradiction. 

The next proposition is a crucial observation. It explains the choice of the A™’* 
in Line E3. 

Proposition 3. The sequence Ck, k>l, is non- decreasing. 

Proof. Let A: > 1. If Ck+i is computed in Line 10, we have Ck+i > Ck. If Ck+i is 
computed in Line E5, we have Rk+i = Rk U {rkQk} and we look at the two cases 
Zk = s and Zk yf e. 

Case 1: Zk = e. Since VkQk = Ck = mincex(/fe,/) and Cfc+i = mincex(/fc+i, /), 
we have to show that fk{x) = fk+i{x) for all x G S* , x < rkak. We show this by 
induction on x, looking at the elements of E* in the canonical order. If x G Rk, 
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then fk(x) = f{x) = fk+i{x)- If a: ^ Rk^ let (r,a,z) be the i?fe-decomposition of 
X. Since ra < raz = x < Xkak, we know that ra ^ rkUk- Thus (r, a, z) is also the 
-decomposition of x and we have 

fk{x) = fk{raz) = ^ A™’Vfc(sz) 

s^Rk ,s<ra 

= ^fe+i/fc-ti(s^) = fk+i{raz) = fk+i{x) . 

s£Rk+i ,s<ra 

Case 2: Zk ^ We show fk+i = fk- This would follow from 

\/i G INo,x G Rk+i,y G r* : fk+i{xy) = fk{xy) . 

We show the latter assertion by induction on i. For i = 0, the assertion follows 
from fk+i(x) = f{x) = fk{x) for x G Rk+i- (Note that fk{rkak) = f{rkak) 
since < VkUkZk = Ck = mincex(/fc, /).) Let z > 0 and assume that the 
assertion is true for 0, ...,z — 1. Let x G Rk+i, y G IfL The induction base 
implies that fk+i{xy) = fk{xy) if xy G Rk+i- Otherwise, let (r,a,z) be the 
i?fc+i -decomposition of xy. Using \z\ < \y\ and the induction hypothesis, we get 
fk+i{sz) = fk{sz) for s G Rk+i- Since ra ^ Rk+i, we know ra ^ rkUk- Thus if 
r G Rk, then 



fk+i{xy) = fk+i{raz) = ^ Xl‘"/^fk+i{sz) 

s^Rk+1 

= X! ^r’Vfe(sz) = fk{raz) = fk{xy) . 

s^Rk 

If r ^ Rk, i.e., r = rkak, then 

fk+i{xy) = fk+i{raz) = ^ Xl°^lfk+i{sz) 

sGRk+1 

= E f E ^ A---‘ ( ^T'^kisz) 

s^Rk \tG-Rfc / t^Rk \sGRk 

^ E fk{i^z) = fkixkakaz) = fk{raz) = fk{xy) ■ 

t&Rk 

Since all prefixtree representations {Rk,ak, Xk) and (Rk,ak,X) are canonical 
prefixtree representations, we know that R{f) C Rk if an EQSC(i?fc, Ofc, Afe) 
query or an EQSC(i?fc, Ofe, A) query respectively returns “YES”. The following 
proposition shows that Rk = R{f) in this case, i.e., {Rk, ak, Xk) or {Rk,ak,X) 
respectively is the unique prefixtree representation of / from Proposition ^ 

Proposition 4. Rk Q R{f) for all k. 

Proof. In Line 12 we know e G R{f) since / 0. We have to show that rkak G 

R{f) whenever the procedure EXTEND is invoked. Suppose that there is a k 
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such that rfcflfc ^ R{f) and EXTEND is invoked. Then there is a finite set 
S C E* with s < rkttk for s G S and there are numbers As G if for s S S' such 
that 

zL. = H^sZi . 

ses 

Since {Rk, oik, Afc) is a canonical prefixtree representation for fk, there are num- 
bers fis,t G K for s G S, t G Rk, t < s, such that 

Zt = E " e S . 

t£Rk ,t<s 

The system in Line 6 had no solution. Thus there is an element z in the list 
Lrj.afc, such that 

/(rfeOfez) E ( E As/Ts,i I . (5) 

teRk,t<rkak ysGS,t<s j 

z was added to Rr^ak because of the counterexample rkUkZ. From Proposition 
0 we know VkRkZ < Ck- For s G S, t G Rk, t < s, we have tz < sz < VkQkZ < 
Ck = mincex(/fe,/), i.e., fk{tz) = f{tz), fk{sz) = f(sz). Hence, 

f{rkakz) = '^Xsf{sz)='^Xsfkisz)= E XsHs,tfk{tz) 
s^S s^S S Rk 

= E ^s^J■s,tf{tz) = E /(^^) f E 

sGS,t^Rk,t<s tGRk,t<rkak \sGS,t<s 

which contradicts 0. 

By now we know that suffixes are added to at most n\E\ lists L^a with r G R{f), 
a G S. The next proposition gives an upper bound on the number of elements 
in any of the lists. 

Proposition 5. For every r G S* , a G S , the list L^a contains at most n + 1 
elements. 

Proof. We can show inductively that 

firaz) = E 

s^Rk ,s<ra 




for r G Rk, a G E, ra ^ Rk and z from the list Lra whenever an EQSC query 
for some prefixtree representation {Rk, cxk, A) is posed. The easy proof is omitted 
here. 
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Let r G S*, a G S. Assume that the elements zi,.. .,zi were added to L^a in 
this order. For 1 < i < I, the word razi was returned by an EQSC(i?, a, A) 
query, (i?, a, A) is a canonical prefixtree representation, R C R{f), {r,a,Zi) is 
the i?-decomposition of raZi, and we have 

f{raz)= /(sz)A™’® , z G {zi, . . . , z,_i} . 

sGi?,s<ra 



Let g : S* K he the function with prefixtree representation (R,a,X). Since 
/(raz.) yf g{raz.) = Y = E , 

sGR,s<.ra sGR,s<.ra 



in the system 



f{raz)= Y f{sz)^s , z G {zi,...,z/} , 

s^R{f),s<ra 



the equation for Zi is independent of the equations for zi, . . . , Zi_i. Since there 
are at most |i?(/)| = n variables, there cannot be more than n + 1 independent 
equations. Thus I <n+l. 



Proposition 6. The learning algorithm poses at most n{n + l)|Af| + 2 EQSC 
queries. 

Proof. For every EQSC query, except the first and the last one, an element is 
added to some list with r G R{f), a G S. Together with Proposition^ this 
yields the assertion. 

Using the last two propositions and the following lemma, we can show that the 
running time of Algorithm E is 0(n®|Af|). 



Lemma 1. For recognizable functions f,g : S* K, f ^ g, there is a word 
cG S* such that /(c) yf g{c) and |c| < vk{H^) + vk{H^) — 1. 



Proof. Let c = ci • • • Cm G Af* be a shortest word such that /(c) yf g{c). Let 
Xi = Cl - ■■ Cj, Ui = Ci+i • • • Cm for 0 < i < TO, d = /(c) - g(c) yf 0. 



((/ 9){^iyj))o<i,j<m 



fd 0 ••• 0\ 



: 0 
\ • * d / 



is a regular submatrix of Thus to + 1 < ik{H^ ®) < rfc(iL-^) + rk{H^). 
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5 A Lower Bound on the Number of EQSC Queries 

Using a method from we show that the upper bound on the number of EQSC 
queries from Sect. 0 cannot be improved substantially. Let 

H := {f : S* K \ / is recognizable} . 

For IF C let SC(lF) be the smallest number m £ INoU{oo} such that m EQSC 
queries are needed in the worst case to learn any function in IF. If 1 < SC(J^) < 
oo, we have 

SC(lF) = 1 + minmaxSC({g S T \ mincex(( 7 , /) > mincex(/i, /)}) . (6) 

Because of SC(lF) > 1, the target concept is unknown. So the learner has to 
pose an EQSC(/i) query (this leads to the summand 1) for some h £ H {so 
we have to take the minimum). Every f £ J- can be the target concept (so 
we take the maximum). If F G Ft is the hypothesis and j £ J- is the target 
concept, the learning algorithm cannot know which of the concepts that are 
still conceivable after receiving the counterexample mincex(F, /) is the target 
concept. The “version space” of still conceivable concepts consists of all functions 
g satisfying mincex(g, /) > mincex(/i, /). 

With this equation and the SC-dimension, that we introduce now, we will prove 
the lower bound. The SC-dimension is comparable to the VC-dimension, see m- 
We call a finite set S C E* SC-shattered for IF if there is a subset Q C T such 
that 

{^Is \g &G} = {g-. S ^ {0,1}} 

and such that the elements of Q all yield the same value on all x £ E* \ S, 
X < max(S'). The SC-dimension of T is 

SCdim(lF) := supHS”! | S' C if* is SC-shattered for IF} . 

Proposition 7. For every set iF CH, SC(lF) > SCdim(lF). 

Proof. We show inductively: 

Vd G No, F" C Ft : SCdim(F^) >d ^ SC{E) >d . (7) 

This is evident for d = 0. Let d > 0 and assume that 0 is true for d — 1. There 
is a SC-shattered set S = |si, . . . , Sd} for E with si < S 2 < • ■ • < Let G be 
a subset of F such that 



idls I 3 e 0} = (g : S ^ (0, 1}} 

and such that the elements of Q all yield the same value on all words x £ E* \ S, 
X < Sd- For every h £ H, there is a function f £ G such that /(si) yf h{si). 
S = |s 2 , . . . , Sd} is SC-shattered for 

F = {g £ F \ mincex(F, /) < mincex(g, /)} . 
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(This can be shown using the set Q = {g £ Q \ g{si) = /(si)}; because of 
mincex(/i, /) < si < S 2 < mincex((;, /) for g £ G C Q, we have G Q Thus 
SCdim(iF) > d — 1. Because of 0, SC(iF) > d — 1. Because of (0, SC(iF) > 
1 + (d — 1) = d. 



Proposition 8. Let T = {f \ S* K I rk{H^) < n\. If n > A and 1^71 > 2 
then SC(J^) > SCdim(J^) > ^n2|r|. 

Proof. There are a,b £ E such that a ^ b. We show that 

S = {a, b} "J -2 . r • {a, 6} 



is SC-shattered for T. Since 

IS”! = 2Li°S 2"J-2 . 1^1 . 2Liog2"J-2 > , 

this proves the assertion. Let G = {f '■ E* ^ {0, 1} | f{x) = 0 for x £ E* \ S'}. 
We show G Q E. A function f £ G non-zero only for words of length 2 [log 2 n\ — 
3. Thus 

[log 2 raj -2 

rk{Hl) = ^ i^Hfi^y))xe{ a,b}’^,yeE‘ + rk{f{xy)) 

xes*,ye{a,bp) 

k=0 
[log 2 raj -2 

< ^ 2-2'=<2-2L^°8^"J-i<n , 

fc =0 



where {a, 5}^ = {x\ ■ ■ ■ Xk \ x\,. . . ,Xk £ {a, 5}}. 

This shows that in order to learn the class of recognizable functions whose Hankel 
matrices have rank at most n, more than ^n2|27| EQSC queries are needed. 
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Abstract. We prove the following results. Any Boolean function of 
O(logn) relevant variables can be exactly learned with a set of non- 
adaptive membership queries alone and a minimum sized decision tree 
representation of the function constructed, in polynomial time. In con- 
trast, such a function cannot be exactly learned with equivalence queries 
alone using general decision trees and other representation classes as 
hypotheses. 

Our results imply others which may be of independent interest. We show 
that truth-table minimization of decision trees can be done in polynomi- 
al time, complementing the well-known result of Masek that truth-table 
minimization of DNF formulas is NP-hard. The proofs of our negative re- 
sults show that general decision trees and related representations are not 
learnable in polynomial time using equivalence queries alone, confirming 
a folklore theorem. 



1 Introduction 

Exact learning using queries is a well-studied model in computational learning. 
Of the many kinds of queries considered by Angluin m, membership and equiv- 
alence queries have established themselves as the standard combination to be 
used for exact learning. 

Some results that characterize learnability with a polynomial number of poly- 
nomially sized queries are known. For example, Angluin j2] showed that the p- 
resence of the approximate fingerprint property for a concept class is a sufficient 
condition for non-learnability with equivalence queries alone. Using this char- 
acterization, Angluin showed that equivalence queries do not suffice for learn- 
ing deterministic or nondeterministic finite automata, context-free grammars, or 
general CNF or DNF formulas. Later on, Gavalda El proved this property to be 
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not only sufficient but also a necessary condition. On the other hand, Goldman 
and Kearns m showed that if a concept class is learnable with a polynomial 
number of membership queries alone, then it has polynomial teaching dimension. 
Moreover, the converse holds provided the concept class is projection closed, as 
shown by Hellerstein et al. m- Finally, it is known m that a concept class is 
learnable with a polynomial number of equivalence and membership queries if 
and only if that class has polynomial certificates. 

Decision trees are a popular representation of Boolean functions, which form 
the basic inference engine used in many machine learning programs. The learn- 
ability of decision trees has been well-studied in the learning theory community. 
For example, it is known d that /i-decision trees are exactly learnable with 
equivalence and membership queries, but neither type of query alone suffices for 
polynomial learning. Bshouty 0 has shown that general decision trees are learn- 
able with extended equivalence queries (the hypotheses used are depth-three 
formulas) and membership queries. It is not known whether general decision 
trees are learnable with proper equivalence queries and membership queries, nor 
is it known whether decision trees are properly PAC-learnable with or without 
membership queries. 

In this paper, we focus on concepts that depend on very few variables. Since 
Littlestone’s seminal work m on this topic, such concept classes have been 
well studied in learning theory mm- Recently, Damaschke |8I9| studied exact 
learning of Boolean functions when irrelevant attributes abound, primarily in 
the model of learning with membership queries alone. He was able to show 
that a set X of non-adaptive membership queries can be constructed in time 
polynomial in n, the number of variables, so that X suffices to exactly learn every 
Boolean function of O(logn) variables in polynomial time. Moreover, a truth- 
table representing the learned Boolean function can be constructed in polynomial 
time as well. Here, we offer a simple alternative proof of this result and show in 
addition that a decision tree of minimum size to represent the learned function 
can also be constructed in polynomial time. In contrast, we show that such 
functions cannot be learned with equivalence queries alone using as hypotheses 
decision trees or other popular representation classes. Our result implies a folk 
theorem that general decision trees cannot be properly learned with equivalence 
queries alone. 

Our positive result shows that truth-table minimization of decision trees can be 
done in polynomial time. This may be contrasted with the well-known result of 
Masek cited in Garey and Johnson’s book H2I that truth-table minimization of 
DNF formulas is NP-hard. 

The rest of the paper is organized as follows. Section 0has definitions used in the 
rest of the paper. Section 0 has our positive results on exact learning with non- 
adaptive membership queries when irrelevant variables abound and Section 0]has 
the negative results for exact learning such concepts with equivalence queries. 
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2 Preliminaries 

We use standard terminology in learning theory. See for more detailed descrip- 
tions of the learning model and a formalization of concept classes, representation 
classes, examples, etc. 

We consider learning algorithms that make equivalence queries or membership 
queries to a teacher. The teacher answers the equivalence queries according to a 
target concept c G C. More precisely, an equivalence query is a representation r G 
TZ of some concept in C, and the answer is either YES, if the concept represented 
by r is equivalent to c, or a counterexample in the symmetric difference of c and 
the concept represented by r, otherwise. A membership query is an example and 
the answer is either YES or NO, depending how the target concept classifies 
that example. 

We are interested in Boolean concepts or, in other words. Boolean functions 
defined over a finite set of Boolean variables {x\, . . . , Xn}- The classes of repre- 
sentations we consider here are the following: 

1. DT (Decision Trees). A decision tree d is a binary tree where the leaves are 
labeled either 0 or 1, and each internal node is labeled with a variable. Given 
an assignment a G {0, 1}”, d(a) is evaluated by starting at the root and 
iteratively applying the following rule, until a leaf is reached: let the variable 
at the current node be Xi; if the value of a at position j is 1 then branch 
right; otherwise branch left. If the leaf reached is labeled 0 (resp. 1) then 
d(a) = FALSE (resp. TRUE). The size of a decision tree is its number of 
nodes. The decision-tree size (DT-size) of a Boolean function / is the size 
of the smallest decision tree that can represent /. 

2. CDNF. A CDNF formula is a pair (f,g), where / is a DNF formula and g 
is a CNF formula and / = g. It is convenient to define the size of such a 
formula as max{n, m,p}, where n is the number of variables over which the 
formulas / and g are defined, m is the number of terms in /, and q is the 
number of clauses in g. The CDNF-size of a Boolean function is the size of 
the smallest CDNF formula that can represent it. 

Equivalent but slightly different definitions of CDNF formulas and their sizes 
can be found in 0 and In 0, it is shown that CDNF formulas can be 
exactly learned in polynomial time with extended equivalence queries (of 
depth-3 formulas) and membership queries. In M, it is shown that CDNF 
formulas are polynomial-query learnable with proper equivalence queries and 
membership queries. Here, we show that this latter result is tight: we show 
that proper equivalence queries alone do not suffice for polynomial-query 
learnability. 

3. Self-Dual DNF (Self-Dual CNF). A DNF (CNF) formula / is self-dual if 
f{a) ^ f{d) for all a G {0, 1}", where d is the bitwise complement of a. We 
direct the reader to 0 for work on the self-dual formulas. 

4. logn-DNF n logn-CNF. This class contains the functions representable 
both as DNF formulas whose terms have at most logn literals, and CNF 
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formulas whose clauses have at most logn literals. For our purposes, we 
consider a representation in this class to be a pair (f,g), where / is a logn- 
DNF formula and g is a log n-CNF formula. The size of such a pair is the 
same as its CDNF size. 

There are several results in learning theory about this interesting class. No- 
tice that this class contains the class of log n-depth decision trees, shown to 
be exactly learnable in polynomial time m with membership queries. (The 
output of the learning algorithm is not a log n-depth decision tree but a 
representation based on Fourier coefficients.) In |^|, it was shown that logn- 
DNF n log n-CNF is learnable with membership queries and an NP-oracle, 
in a representation of depth-3 Boolean formulas. Finally, in ng it was shown 
that this class is properly learnable with membership queries and an oracle 
for NP n co-NP. It is open whether this class is properly learnable with 
membership queries alone in polynomial time. Here we show that the class 
is not polynomial-query learnable with equivalence queries alone. 

5. BP (Branching programs). A branching program is a directed, acyclic graph, 
with a unique node of in-degree 0 (called the root), and two nodes of out- 
degree 0 (called leaves) , one labeled 0, and the other labeled 1; each non- 
leaf node of the graph contains a variable, and has outdegree exactly two. 
Assignments are evaluated following the same rule as for decision trees. 

A branching program is in the class log n-depth BP if its longest pat h has 
length at most log n. The size of a branching program is the number of nodes 
in it. 

General branching programs are known to be not learnable with equiva- 
lence and membership queries. (See m for a detailed study of subclasses of 
branching programs according to learnability.) In |p.5|. it is shown that the 
subclass of log n-depth branching programs is properly learnable with mem- 
bership queries and an oracle for NP n co-NP — it is open whether the oracle 
can be dispensed with. Here we show that the class is not polynomial-query 
learnable with equivalence queries alone. 



3 Learning with Non-adaptive Membership Queries 

Let tFn,r denote the set of Boolean functions over n variables which depend on at 
most r variables. The following is a restatement of a result of Damaschke 

Theorem 1. There exists a set of assignments, X, of size at most a polynomial 
in 2’" and logn such that a truth-table for any function f € Tn,r can be learned 
solely by making membership queries on the set X . Moreover, the set X and the 
truth-table can be constructed in time polynomial in 2’' and n. 

We independently arrived at a proof of the above theorem; however, Damaschke’s 
result has sharper bounds for the construction of X and the time to do the 
inference of the truth-table than our proof provides. For the sake of completeness, 
we outline our simple proof of the above theorem. 
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Use the deterministic construction in m to obtain a (n, r)-universal set B of size 
20 (r) logn in time polynomial in 2’' and n. Let B' be the set of all assignments 
at Hamming distance 1 from an assignment in B] thus \B'\ < n ■ \B\. Let X = 
BUB' . Now, to learn any function / S J^n,r, make membership queries on all the 
assignments in X . Clearly, if we get a different answer for two assignments which 
differ only in variable v, then v must be in the set R of relevant variables of /. 
Conversely, the facts that B is a (n, r)-universal set and that |i?| < r guarantee us 
that if a variable r; is in i? then there are two assignments in X which differ only 
in variable v for which we will get different answers to membership queries. Thus 
we can find the set R in the claimed time. Once R is found, we can construct 
a truth table of entries by searching in X for every projection over the 
variables in R and logging the answer to the membership query. 

Next, we prove the following theorem to conclude that a decision tree of minimum 
size can be output for the learned function / S Xn,r in time polynomial in 2'’. 

Theorem 2. A decision tree of minimum size representing a function f can be 
constructed in polynomial time given a complete truth-table of f. 

Proof : Let V be the set of r variables over which / is defined and let T be a 
truth-table of / of 2’' entries. 

We construct a table P of size 3’’, indexed by partial assignments on the r 
variables. In position a of this array we shall place a decision tree of minimum 
size which computes the function /„ obtained by projecting / to the partial 
assignment a. That is, 

P[a\ = a minimum size DT for fa 

Note that once P is computed then a minimum size decision tree for / is obtained 
by reading off the entry P[A], where A is the empty partial assignment. 

For any tree T, let |T| denote the size of T and for any Boolean function g, 
let 1^1 denote the decision tree size of g. For any partial assignment a and for 
any variable v not assigned a value in a, let a U {u <— 6} denote the partial 
assignment by extending a by assigning the value b € {0, 1} to v. Now, P may 
be constructed by using a dynamic programming approach after observing these 
facts: 

1. For each (complete) assignment a over all r variables, the minimum size 
decision tree for fa is just a constant 0 or 1 node obtained by looking up the 
value of f{a) in the truth-table P. 

2. For every partial assignment a over some set X C V of fewer than r variables, 
if there exists a variable v ^ V — X such that fau{v^o} = /au{ii^i}) then 
fa = fau{v^o}- Consequently, the entry P[a] may be filled by copying the 
entry for P[a U {■(; <— 0}]. 

3. For every partial assignment a over some set X C V of fewer than r variables, 
if there exists no variable v gV — X such that fau{v^o} = fau{v^i}, then 

\fa\ = + l/aU{i;^0}l + I/qU{d^1}|}- 
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Consequently, the entry P[a] may be found by first finding a variable v for 
which the minimum of the above equation is achieved and then constructing 
a tree with v as the root node and P[a U {u <— 0}] P[a U {u <— 1}] as the 
left and right children respectively of v. 

Clearly, the dynamic programming approach to filling the entries of P can be 
accomplished in time polynomial in 2’'. I 

An immediate corollary of the above theorems is that Boolean functions of 
0(log n) relevant variables can be exactly learned using non-adaptive member- 
ship queries and a decision tree of minimum size for the learned function output 
in time polynomial in n. 

4 Learning with Equivalence Queries 

4.1 Approximate Fingerprints 

We recall the following definitions from [am. Let C denote a class of concepts 
defined over an instance space of S* (where A is a finite alphabet), TZ a rep- 
resentation class for C, w a string from E*, b an element of {0, 1}, and a > 0 
a real number. We say that the pair {w, b) is an a-approximate fingerprint with 
respect to C if 



|{c G C : Xc{w) = 5}| < a\C\. 

That is, the number of concepts in C that agree with the classification 6 of w is 
strictly less than the fraction a of the total number of concepts in C. 

A sequence of concept classes is a sequence Ci,C 2 ,C' 3 , ... such that each Ci 
is a class of concepts. Such a sequence is polynomially bounded (with respect 
to a given representation TZ) if there exists a polynomial p(n) such that for all 
c G Cn, there is a representation r G TZ such that c(r) = c and |r| < p{n). 

A representation of concepts TZ is said to have the approximate fingerprint 
property if there exists a polynomially bounded sequence of concept classes 
Ti,T 2 ,Ts, ..., and a polynomial p{n) such that for any polynomial q(n), for in- 
finitely many n, contains at least two concepts and ii r G TZ and |r| < q{n) 
then there exists a string ic G A* of length at most p{n) such that {w, Xc(r){w)) 
is a l/( 7 (n)-approximate fingerprint with respect to r„. That is, 

|{c G T„ : Xc{w) = Xc(r)(w)}| < \Tn\/q{n), 
where c(r) is the concept associated to r. 

Since we are dealing with Boolean concepts defined over n variables, the poly- 
nomial p{n) that bounds the length of the words, can be set to n. The notion of 
approximate fingerprints was introduced as a means of proving non-learnability 
of certain classes in the equivalence query model using the following theorem. 
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Theorem 3. 0 

LetTZ be a representation class of concepts with the approximate fingerprint prop- 
erty. Then there is no algorithm for exact identification ofTZ using polynomially 
many equivalence queries of polynomial size. 



4.2 The Negative Results 

We now show that none of the classes considered here can be exactly learned 
with equivalence queries alone, even when the number of relevant variables is 
O(logn). Since we do not restrict the hypothesis space of equivalence queries 
to only hypotheses of O(logn) relevant variables, our results imply that the 
general classes (with potentially more relevant variables) are also not learnable 
with equivalence queries alone. Indeed, this is how the results are stated and 
proved below. 

Our results for all the classes considered here use the same central idea. We first 
illustrate the technique for the class DT of decision trees. 

In order to prove approximate fingerprints for DT, we need the following lemma. 

Lemma 1. Let k > 0 be some fixed constant. If d is a decision tree of size at 
most n^, over n variables, then there exists an assignment a such that either 

(a) a contains at most klogn I’s and d{a) = 1, or 

(b) a contains at most klogn O’s and d{a) = 0. 

Proof : Since each of the 2” assignments reaches some leaf of d after being 
evaluated, and d has at most leaves, d must have a path p of length at most 
log(n^) = fclogn. The assignment a can be constructed by satisfying p first and 
then filling the rest of the positions with I’s or O’s as necessary. 

Theorem 4. The class DT has approximate fingerprints. 

Proof : Let T„ be the target class of functions that represent the majority of 
logn variables chosen out of n variables. Clearly, \Tn\ = 

Clearly, the DT-size of each function in is at most n. Therefore, it suffices 
to show that for any fixed constant k, and for infinitely many n, and any DT d 
of size at most , the assignment a whose existence is guaranteed by Lemma E 
satisfies the property that fewer than |T„|n“^ of the functions in T„ classify a 
the same way as d. 

Without loss of generality, assume that logn is an even integer. (This avoid- 
s floors and ceilings.) The number of functions in that accept (reject) an 
assignment with fclogn I’s (respectively, O’s) is 
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Using as an upper bound for this sum turns out to be less than 



log n 




which is less than 

Now can be shown to be less than 

^fc(l l 2 |It)(iogn) 2 
(n - logn + 1 )^~, 

which is less than n~^ for n > 2 ”^^, where c is some constant. ■ 

We now show that the remaining classes considered here have approximate fin- 
gerprints. 

Corollary 1. The following classes have approximate fingerprints: 

1 . CDNF Formulas 

2. Self-Dual DNF Formulas and Self-Dual CNF Formulas 

3. logn-DNF n logn-CNF 

4- logn-depth BP 

Proof : The proof follows easily if we first use the following propositions for each 
of the above classes instead of Lemma Q1 
Let fc > 0 be some fixed constant. 

— For every CDNF formula h = (/, 5 ) over n variables, of size at most nf , 
there exists an assignment a that either (a) contains at most klog(2n) I’s 
and h{a) = 1, or (b) contains at most fclog(2n) O’s and h{a) = 0. 

To see this, note that the DNF formula f -\-~g has at most 2 • terms and 
is identically true, i.e., it accepts all 2" assignments. Consequently, such a 
formula must have a term t that accepts at least 2 ^^ assignments, whence 
the length of t is at most k log(2n). The assignment a can now be constructed 
by satisfying the literals in t and then padding the remaining variables with 
O’s or I’s according to whether t is a term in / or corresponds to a clause in 
g respectively. 

— For every self-dual DNF formula f of at most terms over n variables, 
there exists an assignment a that either (a) contains at most klog(2n) I’s 
and f{a) = 1, or (b) contains at most klog{2n) O’s and f{a) = 0. 

Let ^ / denote the DNF formula obtained from / by complementing every 
literal in every term of /. Then, by the definition of self-dual formulas, it 
follows that ^ / = /. Therefore, f-\- ~ / is identically true and the existence 
of a is proved in an identical manner to that of CDNF formula h above. 
Note that an analogous statement holds for self-dual CNF formulas. 



— k log n 

log n 



= r(n). 
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— For every logn-DNF H logn-CNF formula f of size at most nf , there exists 
an assignment a that either (a) contains at most log(n) I’s and /(a) = 1, 
or (b) contains at most log(n) O’s and /(a) = 0. 

This is obvious. 

— For every log n-depth BP h of size at most over n variables, there exists 
an assignment a that either (a) contains at most fclog(n) I’s and h(a) = 1, 
or (b) contains at most klog{n) O’s and h{a) = 0. 

This follows along the same lines as the proof of Lemma [D 

With this change, the proof of Theorem 0 can be used with very slight mod- 
ifications to show that all of these classes have approximate fingerprints. In 
particular, the target class remains the same for all the classes, and each 
function in T„ has a representation of size at most n in each of these classes. ■ 
A final note. Since the target class defined in Theorem0is monotone, the mono- 
tone versions of all the classes mentioned above (including DT) also have ap- 
proximate fingerprints, and are therefore not learnable with equivalence queries 
alone. 
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Abstract. We design asymptotically optimal query strategies for the 
class of parity functions which contain at most k essential variables. 
The number of questions asked is at most twice the number asked by 
an optimal strategy. The strategy presented is even non-adaptive. For 
fixed k, the number of questions is optimal up to additive constants. Our 
results improve upon results by Uehara, Tsuchida and Wegener |^. 



1 Introduction 

One part of algorithmic learning theory is concerned with the problem of iden- 
tifying an object from a set of objects by querying a teacher. Usually, the main 
aim is to keep the number of questions as small as possible. Two other possible 
cost measures are given by the time that it takes the learner to decide which 
questions to ask and the time needed to deduce from the answers the object, 
but this aspect is sometimes neglected. 

In this paper, we consider a problem which is known in learning theory under the 
name “attribute-efficient learning with k essential attributes” . A more precise 
definition will be given shortly. We will show that parity-check matrices can be 
used to obtain so-called non-adaptive and adaptive strategies for learning parity 
functions. 

The previously best known results for the class considered are by Uehara, Tsuchi- 
da and Wegener jSj . We give a different approach to designing query strategies 
and extend the positive results from adaptive to non-adaptive learning strate- 
gies. 

A recent paper by Damaschke |2| is concerned with the learning of arbitrary 
function classes and it gives a good survey of the known results. In that paper, 
well-known combinatorial structures are exploited, like “r-universal sets” , and 
a new structure is designed which is given the name “r-wise bipartite connected 
families” . Since those structures are designed to cope with arbitrary classes, it 
should not be a surprise that for special classes, other combinatorial objects will 
be more efficient. 

Let us now give a formal introduction to the considered problem. Let Bn be the 
class of all Boolean functions on n variables. 

A subset T of Bn is considered to be a concept class. During the learning process, 
we want to find out which function f T has been chosen by a teacher (or an 
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adversary). Our task is to identify the function chosen by the teacher by asking 
questions, more precisely, our query consists in presenting an input a S {0, 1}”. 
The teacher answers with the value /(a). There are two possible models which 
can be considered, namely, the adaptive and the non-adaptive model. In the 
non-adaptive model, the learner has to make all queries at once, i.e., the queries 
are not allowed to depend on previous answers of the teacher. In the adaptive 
model, the teacher gives his answer after each query, and the learner is allowed 
to let his queries depend on all previously given answers. 

Given a concept class T, the cost of a (deterministic) query strategy is the 
maximum number of questions asked, given any function f G before the 
learner has identified /. Since the answers of the teacher are {0, l}-valued, a 
standard information theoretic argument shows that [log \T\\ is a lower bound 
for the number of questions asked. It is also clear that we can identify any chosen 
function by asking at most 2" questions. Our goal is to determine the number 
of questions which is necessary and sufficient. 



2 Basic Definitions 

Here, we review some of the basic notions and properties known in coding theory 
which we will apply later on. The interested reader may consult the books by 
Lidl and Niederreiter by Hoffman et al. and by Peterson and Weldon jS| 
for further details. 

In general, codes are defined over finite fields F, but since we are dealing with 
the field Z 2 = {0, 1} only, we adapt the definitions to this special case. 

For two 0-1-vectors v = (ui, . . . , t>„) and w = {wi , . . . , w„), let d{v, w) denote 
the Hamming distance between the two, i.e., c?(u, w) = \ { i \ Vi ^ Wi} |. 

Definition 1. Given a subset C C {0, 1}" with \C\ > 2, we define the minimum 
Hamming distance d(C) as follows: 

d{C) = min{ d(ci, C 2 ) | Ci C 2 G C}. 

In our considerations, linear codes will prove to be useful. 

Definition 2. Let H be an {n—k) x n-matrix of rank n—k and with entries from 
{0, 1}. The set 

C:={ cG {0,1}" I Tf-c=0} 

is called a linear (n, A:)-code. The matrix H is referred to as the parity-check 
matrix of the code C . 

(Note that here and in the sequel, we do the matrix-vector multiplications over 
the field Z 2 .) 

If the code C which corresponds to a parity-check matrix H has Hamming 
distance d, then we will say that H has code distance d. 

The following equivalence is well-known (see e.g. Lemma 8.14 in the book by 
Lidl and Niederreiter g)): 
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Proposition 1. A parity-check matrix H has code distance s+1 if and only if 
any s columns of H are linearly independent over the field Z 2 . 

There exists a large amount of constructions of linear codes with certain dis- 
tances. One of the few which makes a general statement is the so-called Gilbert- 
Varshamov bound. It states the following (see e.g. Theorem 8.27): 

Theorem 1. Gilbert— Varshamov bound: There exists a linear {n,k)-code 
with minimum distance at least d whenever 




3 Learning Parity Functions with k Essential Variables 

We consider the following two concept classes. 

Definition 3. PAR{k,n) C Bn is the set of all parity functions which depend 
on exactly k variables. PAR<{k,n) is the set of all parity functions which depend 
on at most k variables. I.e., 

PAR{k,n) = {fs I /s = 0Xi,^C{l,...,n},|5| = fc}, 

ies 

PAR<{k,n) = {fs I fs = ^Xi,SC{l,...,n},\S\<k}. 

iGS 

The variables that a function depends on are often called “essential” variables. 
It should be clear that any query strategy for learning PAR<{k,n) can be used 
as a query strategy for PAR{k, n), since the latter class is a subset of the former. 
The class PAR{k,n) contains exactly (^) Boolean functions, and thus a lower 
bound of [log (^)] queries holds. Both classes can be learned with at most n 
queries, since the learner may ask the n queries ei, . . . , e„, where Ci is the input 
having a 1 in the i-th component and a 0 everywhere else. Due to symmetry, 
the complexity of learning PAR(k, n) is the same as the complexity of learning 
PAR{n—k,n). For k > nj2, the class PAR<{k,n) contains more than 2"“^ 
functions, and thus n queries are sufficient and necessary. We can restrict to 
k < [n/2j in the following. 

By using binary search, the class n) can obviously be learned with [log n] 

queries, which is optimal. We are interested in the query complexity for larger 

k. 

Assume that the teacher is keeping secret a parity function fs which the learner 
wants to determine. It will be more convenient to view a query a € {0, 1}" as the 
set A = {i \ 1 < i < n,Qi = 1}. A basic strategy applied in jHI is to determine 
the elements s G S one by one. Let us make this more precise: 

If the learner chooses some a (resp. some set A) and asks for the value of /s(a), 
then the answer determines whether S' H A contains an even or an odd number 
of ones. This suggests the following “binary search procedure” : 
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Given a subset A such that [S' n A\ is odd, i.e., /s(a) = 1, we can identify one 
s G S with at most [log |^|] < [logn] queries. Namely, we partition A = AiU A 2 
such that 0 < \Ai \ — \ A 2 1 < 1- One of those two sets must contain an odd number 
of elements from S, hence by asking one query we obtain a smaller set which 
contains an odd number of elements from S. By repeating this procedure, we 
obtain a set A* of cardinality 1, i.e.. A* = {s} such that [S' n ^*| is odd, i.e., 
s G S. This takes at most [logn] queries. 

If k is odd and if we are to learn PAR{k, n) then we can start the binary search 
procedure with A = {1, . . . ,n\. But how do we start the procedure if k is even? 
So far, we have described the same approach as in |^. There, different solutions 
are proposed in order to obtain an a such that /s(o) = 1. We first show that 
coding theory offers alternative solutions to this problem. Because of the binary 
search phase, the approach from only yields adaptive learning strategies. We 
show how to obtain non-adaptive learning strategies which even use the optimal 
number of queries. 



4 Finding an Input on Which the Teacher Says “Yes” 

We have seen that it may be crucial to find a “1-input”, one, on which the secret 
parity function evaluates to 1. We show how this can be accomplished with the 
help of parity check matrices. 

We can interpret the rows of a matrix H G {0, in a canonical fashion as 
describing m subsets 4i, . . . , Am, each from the set {!,..., n}. 

A parity function can be seen as a column vector v of length n which has 
= 1 if z S S' and = 0 otherwise. (Thus, the Hamming distance between 
parity functions can also be defined.) 

If V is the vector corresponding to the secret parity function, then the teacher’s 
answers on the inputs Ai, . ■ ■ , Am are described by the product H ■ v, more 
precisely, {H ■ v)i is the answer of the teacher on query Ai. 

In order to find a 1-input for parities with k essential variables, it is enough to 
find a matrix H with n columns such that no sum of k columns of H is equal 
to the zero vector. (Let us denote this property as the “fc-sum-property” .) The 
reason is that for such a matrix H, any answer H ■ v oi the teacher will not be 
zero, and thus reveal a 1-input Ai. 

A parity-check matrix H with code distance fc + 1 (see Definition |3) fulfills the 
j-sum-property for all j = 1, . . . , A:: 

Assume that H has m rows and n columns. By Proposition 01 any set of j < k 
columns is linearly independent over Z 2 , in particular, no sum of 1 < j < A: 
columns is equal to zero. 

We can convince ourselves easily that, because of this property, parity-check 
matrices can be used in the same fashion in order to find a 1-input when the 
secret function is from PAR<{k,n). If none of the inputs is a 1-input, then the 
secret parity function is the empty parity function. 
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5 Designing Learning Strategies 



Let us start by showing a bound for matrices which have the fc-sum-property. 
The proof is completely analogous to the proof of the Gilbert-Varshamov bound. 
Note that the lemma only makes sense if k is even, because otherwise we could 
take the matrix which has a single row of ones only. 



Lemma 1. Let k > 2. There is a matrix H with at most m > 1 rows and n 
eolumns which has the k -sum-property whenever the following holds: 



> 



n — 1 
k-l 



Proof. Choose any vector from {0, 1}™ as the first column of H. Now assume 
that j < n—1 columns have already been chosen such that no k columns of them 
add up to zero. There are ways to choose /c — 1 of those columns, hence at 
most columns are forbidden as column j+1. Since < (feZi) < 2™, 

there is at least one vector left which we can choose as the j+l-th column. 



As a consequence, for a secret parity function / G PAR{k,n), {k even), we can 
find an a with /(a) = 1 with m queries, whenever m > log (^Zi)- Unfortunately, 
this helps us only in finding one variable. After that, the task remains to identify 
a parity function on a smaller number of variables. 

It turns out that matrices which have the j-sum~property for all j = 1, ... ,k 
simultaneously are the solution to the problem. Parity-check matrices are such 
matrices. 

We first describe a procedure which also contains binary search phases: 



Theorem 2. Let Ai, . . . , Am be the subsets o/ {1, . . . , n} corresponding to the 
rows of a parity-check matrix with code distance k+1. Then the class PAR<{k, n) 
can be learned adaptively with at most m + k ■ [logn] queries. 



Proof. Note that no 1 < j < fc columns in the parity-check matrix add to the 
zero column. First, we ask the teacher for the values of fs{Ai), . . . , fs{Am). If 
S' yf 0, then for at least one i, we have that \Ai n S| is odd and with [logn] 
more questions we can identify one variable s G S by binary search. It remains 
to identify the parity function fs\{s}- The teacher still gives the answers for the 
function fs, but we can think of “replacing” him by a teacher for the function 
fs\{s}- This can be achieved by replacing any answer fs{A) of him by /s(A) 0 

f{s}(A). 

Now, the procedure can be iterated. In a more algorithmic notation, the query 
strategy looks as follows and it is easy to see that the number of queries is at 
most m + k [log n] . 
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Identified := 0; stop := false' 

for j = 1, . . . , m do answer^ = fs{Ai)-, 

while not stop do 

begin 

find an i such that answer^ = 1; 
if no such i exists, then stop := true; 

Otherwise: 

identify an s G S' \ Identified by binary search on Ai \ Identified. 
Identified := Identified U {s}; 

“replace the current teacher” by a teacher for /s\identified- 
for t = 1, . . . , TO do answer^ := answer^ © f^s'^(Ai); 

end; 

return Identified; 



In order to keep the number of queries small, we should use parity-check matrices 
with few rows only. We will consider concrete constructions in Section El 
Surprisingly, it will turn out that the above adaptive learning strategy is not 
competitive enough and that we can design nearly optimal learning strategies 
that are even non-adaptive. Let us turn to the construction of such non-adaptive 
learning strategies. 

We need the following lemma, in which we again identify parity functions fs 
with binary vectors of length n. It is implicitly equivalent to Proposition H 

Lemma 2. Let f be an arbitrary parity function. If H is a {0, 1} -matrix in 
which each set oft columns is linearly independent, then the set C = {g \ H ■ g = 
H ■ /} has Hamming distance at least t + 1. 

Note that if / is the secret parity function, then the set C can be seen as the set 
of parity functions consistent with the answers of the teacher. 

Proof. Let ci ^ C2 G C. Then, H ■ {c\® C2) = {H ■ c\) ® {H ■ C2) is equal to the 
zero vector. If c\ and C 2 have Hamming distance less than t + 1, then c\ © C 2 
contains at most t ones. Thus, a set of at most t column vectors of H would be 
linearly dependent. 

We obtain the following application: 

Theorem 3. Let H be a matrix in which each set of 2k columns is linearly inde- 
pendent. Let Ai, . . . ,Am be the sets corresponding to the rows of H. PAR<{k,n) 
can be learned non- adaptively by asking the m queries Ai,. . . ,Am. 

Proof. Assume that there are two parity functions fs and fs' consistent with the 
answer of the teacher. By Lemma 0 they must have Hamming distance at least 
2A:+1. This is not possible for parity functions on at most k essential variables. 
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The correspondence between non-adaptive learning strategies and parity-check 
matrices also holds in the other direction. 

Lemma 3. A set {oi, . . . , Om} of vectors Gi G {0, 1}" which is a non-adaptive 
learning strategy for PAR<{k,n) can be turned into a parity-check matrix with 
code distance 2fc+l and at most m rows. 

Proof. We can assume w.l.o.g. that no subset of {oi, . . . , am} is a non-adaptive 
learning strategy for PAR<{k,n) as well. As a consequence, the Oi are linearly 
independent over Z 2 . Now consider the matrix where the are arranged as 
rows. 

It is enough to prove that the sum of any j columns, 1 < j < 2fc, is not equal 
to the zero vector. Assume the contrary, i.e., assume that (w.l.o.g.) the columns 
1, . . . , j add up to the zero vector, li j <k, then the parity function f$ = Q and 
the parity function xi 0 • ■ ■ © a;^ both yield the same answers by the teacher, 
hence they cannot be distinguished by the learning strategy. If j > k, then one 
can consider the parity function and the parity function f{k+i,...,j} and 

argue as before. 

Lemma 0 allows us to transfer lower bounds known in coding theory to lower 
bounds on the number of queries necessary in non-adaptive learning strategies. 

6 Nearly— Optimal Non— Adaptive Strategies 

In this section, we show how known constructions from coding theory can be 
applied to obtain non-adaptive learning strategies which are nearly optimal in 
the class of all (adaptive or non-adaptive) learning strategies. 

We could use the Gilbert-Varshamov bound (Theorem Pi to obtain a parity- 
check matrix H with m rows which has code distance k+1 if 2”^ > ("7^)- 

If fc < n/3, the right hand side is smaller than (^) and by Theorem!^ we would 
obtain an adaptive learning strategy with at most [log (^)] + /c[logn] queries, 
but the strategy used in the following theorem leads to a better bound: 

Theorem 4. If k < nj2, then PAR<{k,n) can be learned non- adaptively with 
at most 2 • log ( 7 ) + 1 queries. 

The lower bound of [log (7)1 already holds for the function class PAR{k, n) and 
all (adaptive or non-adaptive) strategies. Thus, the query number in Theorem 
Elis asymptotically optimal. 

Proof. We use the Gilbert-Varshamov construction to obtain a parity-check 
matrix with m rows and code distance 2A:+1 which exists if 
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i.e., m = 1 + [log^ij is enough. Observe that (for k < n/2) Si < < 

have Si = ( 2 i^_i) + { 21 ^- 3 ) ■*■■■■■*■(!) interpreted 

as the number of choices to pick a subset of odd cardinality at most 2k— 1 from 
an n-element set. Every such subset can also be obtained by first choosing a 
/c-element subset A of n}, then choosing a fc— 1-element subset B of 

{1, . . . , n} and taking the symmetric difference of both. Thus < (^) and 
TO < 1 + [log^ij < 1 + [2 log (^)J and by Theorem 0 this number of queries is 
enough. 

The strategy from Theorem E] improves the results from jEj and it also has the 
advantage of being non-adaptive, in contrast to the strategies given in [ff|. 

The maybe more important aspect here is that by using parity-check matrices, 
we have made applicable other constructions from coding theory as well. 

Until now, we have neglected the efficiency with which these parity-check ma- 
trices can be constructed. Codes that can efficiently be constructed are the so- 
called BCH-codes. The following theorem shows how they can be applied to 
obtain very efficient non-adaptive strategies. 

Theorem 5. Let n = 2“’ — 1 for some w >2. There is a non-adaptive learning 
strategy for learning PAR<{k,n) with at most klog{n + 1) queries. The set of 
queries can he constructed in polynomial time. 

Proof. Choose some 2t -|- 1 < n. Then it is known (see e.g. the paper by Alon, 
Babai and Itai Q, Proposition 6.5) that we can construct a matrix H which is 
the parity-check matrix of a BCH code with minimum distance 2t-|-2 as follows0 
Let xi, ... ,Xn be n column vectors of length w over Z 2 which binary represent 
the n nonzero elements of the Galois field with 2™ elements. Then the matrix H 
has wt rows and n columns and is constructed as follows: 



H = 



/ 

V 



Xi 


X 2 


X 3 ■ 


. . Xn 


3 


3 


3 


3 


xi 


xi 


X% . 


■■ K 


2t-l 


2t-l 


2t-l 


2t — l 


Xi 


X2 


a^3 


’ • 



\ 

/ 



Since w = log(n-|- 1), and since we are looking for codes with Hamming distance 
at least 2k+l, we choose t=k and obtain a matrix with fclog(n -|- 1) rows. Thus, 
the corresponding non-adaptive learning strategy uses fclog(n -|- 1) queries. 



If n is not of the form that fits Theorem |5J we can consider the smallest N > n 
which is of the appropriate form, i.e., some N < 2n—l. Since PAR<{k,n) can 
be seen as a subset of PAR<{k, N), we obtain the following corollary: 



Corollary 1. For every n > 3, there is a non-adaptive learning strategy for 
learning PAR<{k,n) with at most klogn + k queries. The set of queries can be 
constructed in polynomial time. 



^ If 2t -I- 2 = n -I- 1, then the code contains only one codeword. 
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Since the lower bound log (^) is at least fclog j — fclogn — klogk, the strategy 
used in the corollary is, for fixed k, optimal up to a constant additive term. 

The query strategies described in are not constructive since they do not 
compute the query set in polynomial time. In that paper, a constructive query 
strategy based on “splitters” - as suggested by the present author - is mentioned 
which has a query number of 0{k‘^logn). Hence the above bound improves on 
this query number and, in addition, our strategy has the advantage of being 
non-adaptive. 



7 Fine-Tuning 



We give a sketch of how the number of queries could be reduced further in special 
cases. 

Assume that we are given a matrix H with n columns in which each 2k— 2 
columns are linearly independent. Ask the teacher for the values of the secret 
parity function on the rows Hi . By Lemma E| the set of parity functions consis- 
tent with the answers has Hamming distance at least 2k— 1. As a consequence, 
if fsi and /sj, both from PAR(k,n), are consistent with the answers, then 
S'! n S '2 = 0- We are thus left with at most \n/k\ candidate functions from 
PAR{k, n) and they have disjoint variable sets. We can determine which of those 
parity functions the teacher has in mind by at most [log [n//cj ] queries. The 
reason for this is that, as in the case of PAR{l,n), we can use binary search. 

As an application, we show that the above leads to a query strategy which for 
the class PAR{2,n) needs at most one query more than the trivial lower bound 

[log (2)1: 

Theorem 6. PAR(2,n) can be learned adaptively with at most 



[logn] -b [log([n/2j)l < 




queries. 



Proof. We represent the numbers 0, . . . , n— 1 as binary vectors of length [log n] . 
Then we choose H to be the [log n] x n matrix which contains these bit vectors 
as columns. Matrix H has the 2-sum-property. Our strategy queries the teacher 
for his answers on the rows of H . 

Since two parities from PAR{2,n) have Hamming distance either 2 or 4, the set 
of parity functions from PAR{2, n) consistent with the answers of the teacher has 
Hamming distance 4, i.e., they are all on disjoint variable sets. By the remarks 
before the theorem, we can determine by [log[n/2j] additional queries which of 
the remaining candidate parity functions is the secret one. The inequality stated 
in the theorem follows from standard calculations. 

Another way to see the bound is to observe that in the matrix H, every row 
contains at most [n/2j many ones. Hence, a binary search on the corresponding 
row reveals one variable and the second is also determined by the first one. 
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8 Remarks 

We have neglected the question how efficiently the learner can determine the 
secret parity function fs from the answers of the teacher. It should be clear 
that due to the close connection to coding theory, this problem can be solved 
with the usual techniques of error-correction, but we refrain from making these 
connections more explicit since it was our main aim to demonstrate the usefulness 
of parity-check matrices for optimizing the query number. 
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Abstract. We study the learnability of first order Horn expressions 
from equivalence and membership queries. We show that the class of 
range restricted Horn expressions, where every term in the consequent 
of every clause appears also in the antecedent of the clause, is learnable. 
The result holds both for the model where interpretations are examples 
(learning from interpretations) and the model where clauses are examples 
(learning from entailment). 

The paper utilises a previous result on learning function free Horn expres- 
sions. This is done by using techniques for flattening and unfiattening 
of examples and clauses, and a procedure for model finding for range 
restricted expressions. This procedure can also be used to solve the im- 
plication problem for this class. 



1 Introduction 



We study the problem of exactly identifying universally quantified first order 
Horn expressions using Angluin’s |Ang88| model of exact learning. Much of 
the work in learning theory has dealt with learning of Boolean expressions 
in propositional logic. Early treatments of relational expressions were given 
by |Va,l8.'rlHa,u89| . but only recently more attention was given to the subject 
in framework of Inductive Logic Programming |AlDR.94ICoh95aKJoh95b| . It is 
clear that the relational learning problem is harder than the propositional one 
and indeed except for very restricted cases it is computationally hard j( ;oh95hj . 
To tackle this issue in the propositional domain various queries and oracles 
that allow for efficient learning have been studied |Val84|Ang88| . In particular, 
propositional Horn expressions are known to be learnable in polynomial time 
from equivalence and membership queries In the relational do- 

main, queries have been used in several systems IHha8,1ISR8(ll)ftB9'illVm9‘il and 
results on learnability in the limit were derived |Sha91IL)HB92| . More recently 
progress has been made on the problem of learning first order Horn expressions 

* This work was partly supported by EPSRC Grant GR/M21409. 
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from equivalence and membership queries using additional constraints or other 
additional queries |Ari97IH!r97IKha98IR!r98IH,S98| . 

In particular |Kha98[ shows that universally quantified function free Horn ex- 
pressions are exactly learnable in several models of learning from equivalence and 
membership queries. This paper extends these results to a class of expressions 
allowing the use of function symbols. In particular, we present algorithms for 
learning range restricted Horn expressions where every term in the consequent 
of a clause appears also (possibly as a subterm) in the antecedent of the clause. 
In fact, our results hold for a more expressive class, weakly range restricted Horn 
expressions, that allows for some use of equalities in the antecedent of a clause. 
Several kinds of examples have been considered in the context of learning first 
order expressions. The natural generalisation of the setup studied in proposition- 
al logic suggests that examples are interpretations of the underlying language. 
That is, a positive example is a model of the expression being learned. Anoth- 
er view suggests that a positive example is a sentence that is logically implied 
by the expression, and in particular Horn clauses have been used as examples. 
These two views have been called learning from interpretations and learning from 
entailment respectively and were both studied before. We present algo- 

rithms for learning weakly range restricted Horn expressions in both settings. 
We also show that the implication problem for such expressions is decidable, 
and provide an upper bound for its complexity. This motivates the use of this 
class since learned expressions can be used as a knowledge base in a system in a 
useful way. 

The result for learning from interpretations is derived by exhibiting a reduc- 
tion to the function free case, essentially using flattening - replacing function 
symbols with predicate symbols of arity larger by one |Rou92j . The reduction 
uses flattening and unflattening of examples and clauses, and an axiomatisa- 
tion of the functionality of the new predicates. Learning from entailment is then 
shown possible by reducing it to learning from interpretations under the given 
restrictions. This relies on a procedure for model finding for this class, which 
also proves the decidability of inference for it. We also derive learnability results 
for range restricted expressions as corollaries. Interestingly, despite the use of 
reduction, for learning from entailment we can use range restricted expressions 
as the hypothesis language, but for learning from interpretations hypotheses are 
weakly range restricted. 

The rest of the paper is organised as follows. The next section provides pre- 
liminary definitions. Section 0 presents range restricted expressions and some 
of their basic properties. Sections 0 and 0 develop the results on learning from 
interpretations and learning from entailment respectively, and Section0 discuss- 
es the implication problem. The concluding section discusses related work and 
directions for future work. 
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2 Preliminaries 



2.1 First Order Horn Expressions 



We follow standard definitions of first order expressions; for these see 
p^llo87) . The learning problems under consideration assume a pre- fixed known 



and finite signature S of the language. That is, S = (P, F) where P is a finite set 
of predicates, and P is a finite set of function symbols, each with its associated 
fixed arity. Constants are simply 0-ary function symbols and are treated as such. 
In addition a set of variables xi,X 2 ,xs, . . . is used to construct expressions. 

We next define terms and their depth. A variable is a term of depth 0. A constant 
is a term of depth 0. If p, . . . , are terms, each of depth at most i (and one with 
depth precisely i) and / G P is a function symbol of arity n, then /(P, . . . , 
is a term of depth z + 1. 

An atom is an expression p(ti, . . . ,tn) where p G P is a predicate symbol of 
arity n and p, . . . are terms. An atom is called a positive literal] a negative 
literal is an expression ~^l where I is a positive literal. A clause is a disjunction of 
literals where all variables are taken to be universally quantified. A Horn clause 
has at most one positive literal and an arbitrary number of negative literals. 
A Horn clause ~^pi V ... V ~^pn V Pn+i is equivalent to its “implicational form” 
Pi A . . . Apn — > Pra+i • When presenting a clause in this way we call pi A . . . Ap„ the 
antecedent of the clause and Pn+i the consequent of the clause. A Horn expression 
is a conjunction of Horn clauses. 

The truth value of first order expressions is defined relative to an interpretation 
I of the predicates and function symbols in S innKZ]. Interpretations are also 
called structures in model theory jCKhO) and we use these terms interchangeably. 
An interpretation I includes a domain D which is a (finite) set of elements. For 
each function symbol / G P of arity n, I associates a mapping from P" to D] if 
/(tti, . . . , On) is associated with a we say that /(ai, . . . , a„) corresponds to a in 
I . For each predicate symbol p G P of arity n, I specifies the truth value of p on 
n-tuples over D. The extension of a predicate in / is the set of positive instan- 
tiations of it that are true in I. In structural domains |Ha,iiR9IKha,fl(lRTRfl6j . 
domain elements are objects in the world and an instantiation describes prop- 
erties and relations of objects. We therefore refer to domain elements as objects. 
Let str{S) be the set of structures (interpretations) for the signature S. 

The truth value of an expression in an interpretation / is defined in a standard 
way |CK90ILfo87| . Note that a Horn clause is not true (falsified) in / iff there is a 
variable assignment (a substitution) that simultaneously satisfies the antecedent 
and falsifies the consequent. The terms (1) T is true in /, (3) / satisfies P, (4) 
/ is a model of T, and (5) I \=T, have the same meaning. Let Ti, P 2 G =) 
then Ti implies T 2 , denoted T\ ^ T 2 , if every model of Pi is also a model of P 2 . 



2.2 The Learning Model 

We define here the scheme of learning from interpretations |nR,n94j . Learning 
from entailment lEEnn], where examples are clauses in the language is defined in 
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Section 0 An example is an interpretation; an example I is positive for a target 
expression T \i I T and negative otherwise. Examples of this form have been 
used before by various authors including |Hau89IDR,D94IRTH7IKhall^ . 

We use Angluin’s model of learning from Equivalence Queries (EQ) and Member- 
ship Queries (MQ) |Ang88| . Let be a class under consideration, H.' a (possibly 
different) class used to represent hypotheses, and let T G be the target expres- 
sion. For membership queries, the learner presents an interpretation / and the 
oracle MQ returns “yes” iff / ^ T. For equivalence queries, the learner presents 
a hypothesis H G H' and the oracle EQ returns “yes” if for all /, / |= T iff 
I \= H; otherwise it returns a counter example / such that I \= T and I ^ H {a, 
positive counter example) or / ^ T and I \= H (a negative counter example). 
In the learning model, T G is fixed by an adversary and hidden from the 
learner. The learner has access to EQ and MQ and must find an expression 
H equivalent to T (under the definition above). If there is an algorithm that 
performs this task we say that Ji is learnable with hypothesis in Ti! , or, when 
Ti! = H, just Ti. is learnable. For complexity we measure the running time of the 
algorithm and the number of times it makes queries to EQ and MQ. It is known 
jI4t88^Ang88| that learnability in this model implies pac-learnability |Va,l84j . 

3 Range Restricted Horn Expressions 

Definition 1. (definite clauses) A clause is definite if it includes precisely 
one positive literal. For a signature S, let Ti.{S) be the set of Horn expressions 
over S in which all clauses are definite. 

Definition 2. (range restricted clauses) A definite Horn clause is called 
range restricted if every term that appears in its consequent also appears in its 
antecedent, possibly as a subterm of another term. For a signature S, let Hr{S) 
be the set of Horn expressions over S in which all clauses are definite and range 
restricted. 

For example, the clause (pi(/i(/ 2 (a^)), /aO) ^ P 2 {f 2 {x),x)) is range restrict- 
ed, but the clause (pi(/i(/ 2 (a;)), /aO) — > P 2 {fi{x),x)) is not. We also consider 
clauses with a limited use of equality in their antecedent. 

Definition 3. (equational form) A definite clause C with equalities in its an- 
tecedent and where every non- equational literal includes only variables as terms, 
and every equational literal is of the form (a:i„_,.i = f{xi ^ , . . . , a^i^)) where f G F 
and Xij are variables is in equational form. For a signature S, let T-L{S, =) be the 
set of Horn expressions over S in which all clauses are definite and in equational 
form. 

^ A similar restriction has been used before by several authors. Unfortunately, in a 
previous version of [Kha,98) it was called “non-generative” while in other work it 
was called “generative” [M F92j . The term “range-restricted” was used in database 
literature for the function free case |Mm5H| . Here we use a natural generalisation for 
the case with function symbols. 
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Every range restricted clause can be transformed into an equational form by 
unfolding terms bottom-up and replacing them with variables. Formally, for 
T e Hr{S), transform each clause C in T as follows. Find a term f{xi , . . . , Xn) 
in C all of whose sub-terms are variables (this includes constants) and rewrite 
the clause by replacing all occurrences of this term with a new variable z, and 
adding a new literal {z = /(xi, . . . , a;„)) to the antecedent of C. For example the 
clause 

Pl{xiji{x 2 )) Ap 2 (/ 2 ()) 
is first transformed (using fi{x2)) into 

{zi = fi{x2)) Api{xi,Zi) Ap2(/2()) ^Pi(a:i,/2()) 

and then (using the constant /2O) into 

((zi = fi{x2)) A {Z2 = /2O) Api(a;i,zi) Ap2{z2) pi{xi,Z2)). 

In the equational form of range restricted clauses each new variable Zi has one 
defining equation, and we may think of the variables involved in the equation as 
its ancestors. For example, X2 is an ancestor of zi in the example above. Con- 
structed in this way all variables that appear in equational literals are ancestors 
of some variable in the original literals. Since the clause is range restricted this 
holds for variables in the consequent as well. We will consider the case where Zi 
may have more than one such equation as in 

{{zi = h{xi)) A {zi = fl{x2)) A {Z2 = /2O) Api(xi,Zi) Ap2(^2) ^Pl{xi,Z2)) 

but where the variables in equations are still ancestors in this sense. 

Definition 4. (root variables, legal ancestor) Let C be a definite Horn 
clause in equational form. ( 1 ) The variables appearing in non- equational literals 
in the antecedent are called root variables. ( 2 ) Root variables are legal ancestors. 
( 3 ) If an equational literal {z = f{x\, . . . ,Xn)) appears in the antecedent and z 
is a legal ancestor then x\,. . . ,Xn are also legal ancestors. 

Definition 5. (weakly range restricted clauses) A definite Horn clause in 
equational form is called weakly range restricted if every variable that appears in 
its consequent or in equational literals is a legal ancestor. For a signature S, let 
TLr{S,=) be the set of Horn expressions over S in which all clauses are definite 
and weakly range restricted. 

The following proposition (proof omitted) shows that we can replace range re- 
stricted expressions with their equational form. 

Proposition 1. LetT € Hr{S) and letT' be the equational form ofT computed 
as above then for all I G str{S), I \= T if and only if I \= T' . 

We next define legal objects in interpretations to play a role similar to legal 
ancestors in clauses. 




116 



Roni Khardon 



Definition 6. (legal objects) Let I G str{S), and let D be the domain of I. 

(1) If p{ai, . . . , On) is true in I for p G P then ai, . . . , a„ are legal objects. 

(2) If f{ai, ... ,an) is mapped to a„+i in I where f G F and a„+i is a legal 
object then ai, . . . , a„ are legal objects. 

The main property of weakly range restricted expressions used in our construc- 
tions is the fact that their truth value of is not effected by non-legal objects. 



Lemma 1. Let I G str(S), let C be a weakly range restricted clause, and let 6 
be a mapping of the variables of C into objects of I. If I ^ C9 then all objects 
mapped by 6 are legal objects. 

Proof. Assume that 0 maps a variable of C to a non-legal object. If the equational 
literals in the antecedent of C are not satisfied in I by 0, then / \= C0. 

Notice that if a variable x is mapped to a non-legal object in / and (z = 
/(. . . ,x, . . .)) is true in I then z is also mapped to a non-legal object. Now, 
since every variable is an ancestor of some root variable in C, if the equational 
literals of C are satisfied in I by 0, then some root variable is mapped to a non- 
legal object by 0. By definition this implies that an atom p{. . .) in the antecedent 
of C is made false by 0. Therefore I \= C0. ■ 

4 Learning from Interpretations 

We first define a modified signature of the language above. Similar transfor- 
mations have been previously used under the name of flattening (see 

also (HEIMnZl). For each function symbol / of arity n, define a new predi- 
cate symbol fp of arity n -I- 1. Let Fp be the new set of predicates so defined, 
S' = {P LI Fp,^) be the modified signature, Hr{S') the set of range restricted 
(and function free) Horn expressions over the predicates in P U Fp, and str{S') 
be the set of interpretations for S' . 

Reductions appropriate for learning with membership queries were defined in 
FkM| where they are called pwm-reductions. Three transformations are re- 
quired. The representation transformation maps T G Hr{S) to T' G Hr{S'), the 
example transformation maps / G str{S) to /' G str{S'), and the membership 
queries transformation maps /' G str(S') to {Res, fVo} U str(5). Intuitively, the 
learner for T G Hr{S) will be constructed out of a learner for T' G TIr{S') (the 
image of the representation transformation) by using the transformations. The 
obvious properties required of these transformations guarantee correctness. The 
example and representation transformations guarantee that the learner receives 
correct examples for T' and the membership query transformation guarantees 
that queries can be either answered immediately or transferred to the member- 
ship oracle for T. 

The Representation Transformation: Let T G H{S,=) be a Horn expression, 
then the expression fiat{T) G TLr{S') is formed by replacing each equational 
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literal (z = /(xi, . . . , Xn)) with the corresponding atom fp{xi, . . . , Xn, z)- Thus, 
the equational clause given above is transformed to: 

f3p(xi,zi) A fip(x2,zi) A /2p(^2) Api(a;i,zi) Ap2(^:2) ^ pi(a:i,Z2)- 

The definitions of root variables and legal ancestors hold for the fiat versions as 
well. We also axiomatise the fact that the new predicates are functional. Our 
treatment diverges from previous uses of flattening in that the function 

symbols are taken out of the language. For every f € F oi arity n let 

exist f = (yxi,^X2 ■ ■ ■ ,yxn, 3z, fp(xi ,. . .,Xn,z)) 
unique / = (Vxi , \/x 2 . . . , Va:„ , Vzi , Vz 2 > 

fp{xi, . . .,Xn,Zi) A fp{xi, . . .,Xn,Z2) (zi = Z 2 )) 

Let (j)f = existf /\uniquef, (ftp = t\f^F4>fi and Aunique = /\f^Funiquef. We 
call existf the existence clause of / and unique f the uniqueness clause. 

The Example Transformation: Let I be an interpretation for S, then fiat(I) 
is an interpretation for S' defined as follows. The domain of flat{I) is equal 
to the domain of / and the extension of predicates in P is the same as in I. 
The extension of a predicate fp € Fp of arity n + 1 is defined in a natural 
way to include all tuples (oi, . . . , a„, o„+i) where Oi are domain elements and 
/(tti, . . . , On) corresponds to o„+i in /. 

Lemma 2. For all T G TL{S,=) and for all I G str{S): 

(1) flat{I) 1= (j)p. 

(2) I \= T if and only if flat{I) ^ flat{T). 

Proof. Since each constant and each term are mapped to precisely one domain 
element in I, part (1) is true by the construction of flat{I). For (2) note that 
flat(T) and the equational form of T have the same variables, and / and flat(I) 
have the same domain. Let 0 be a mapping of these variables to the domain. By 
construction, (z = f{xi , . . . , Xn))0 is true in / if and only if fp{xi , . . . , Xn, z)6 
is true in flat{I). Moreover, predicates in P have the same extension in I and 
flat{I). Therefore, a falsifying substitution for one can be used as a falsifying 
substitution for the other. ■ 

The Membership Queries Transformation: A mapping converting structures 
from str(S') to str{S) is a bit more involved. Let J G str(S'); if J \= (pp then 
the previous mapping can simply be reversed, and we denote it by unflat{J). 
Otherwise there are two cases. If J falsifies the uniqueness clause, it is in some 
sense inconsistent with the intension for usage of the functional predicates. Such 
interpretations are not output by the algorithm of |Kha98j when learning TCr{S') 
and hence we do not need to deal with them. If J satisfies the uniqueness clause 
(of all function symbols) but falsifies the existence axiom then some information 
on the interpretation of the function symbols is missing. In this case we complete 
it by introducing a new domain element * and defining complete{J) G str{S') 
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to be the interpretation in which every ground instance of the existence clauses 
which is false in J is made true by adding a positive atom whose last term is *. 
For example, if there is no b such that fp{l,b) is true in J then we add /p(l, *) 
to complete{J) . For any J G str{S') such that J [= Aunique, the interpretation 
J is transformed in this way into unflat{complete{J)) . Note that the object * is 
non-legal (cf. Definition0 in unflat{complete{J)) . 

Lemma 3. For all T G Ti.n{S, =) and for all J G str{S') such that J |= Aunique 
the following conditions are equivalent. 

(1) J^flat{T). 

(2) complete{J) \=flat{T). 

(3) unflat{complete{J)) ^ T. 

Proof. Since unflat{complete{J)) is in str{S) and flat{unflat{complete{J))) = 
complete{J), Lemma 0 implies that (2) and (3) are equivalent. Now, if J ^ 
flat{T)9 for some 9 then since the completeQ construction does not change the 
truth value for atoms whose objects are in J we also have complete{J) ^ fiat(T)9. 
Thus (2) implies (1). 

Finally, if unflat{complete{J)) ^ T9 then by LemmaOl 9 does not use non-legal 
objects, and in particular the object *. Hence we can use 0 in J and the argument 
in Lemma Elshows that J flat{T)9. Therefore (1) implies (3). ■ 

For S — (P,F) let |5| = |P| -I- |F| be the number of symbols in the signature. 

Theorem 1. The class =) is learnable from equivalence and membership 

queries with hypothesis in Hr{S'). 

For T G Ttn{S, =) with m clauses and at most t variables, the algorithm makes 
equivalence queries, 0{{nm + m?T)\S\F) membership queries, 
and its running time is polynomial in n*" + t^ + m + \S\ + P , where n is the 
number of objects in the largest counter example presented to the algorithm, and 
r is the maximal arity of predicates and function symbols in S. 

Proof. The theorem follows from properties of pwm-reductions [AKH5| and the 
result in showing that 'Hr{S') is learnable. The idea is that when learn- 

ing T G TLr(S, =) we will run the algorithm A2 from [Khat)8| to learn the 
expression flat{T) G Hr{S'). When A2 presents FI G Hr{S') to an equivalence 
query we interpret this by saying that I G str{S) is a model of H if and on- 
ly if flat(I) ^ H . Hence given a counter example I we simply compute flat(I) 
and present it as a counter example to A2. Lemma 0 and the above interpre- 
tation guarantee that the examples it receives are correct. When A2 presents 
J for a membership query, we compute unflat{complete{J)), present it to MQ 
and return its answer to A2. Lemma 0 guarantees that the answer is correct. 
By Corollary 11 of [IKha,h8| we get that A2 will find an expression equivalent to 
flat{T). The complexity bound follows from [Kha98j . ■ 
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4.1 Modifying the Hypothesis Language 

The previous theorem produces a hypothesis in Tints') while the target expres- 
sion is in Hr{S,=). We next show how to use a hypothesis in Hr{S,=). 

We first need to describe the hypothesis of the learning algorithm A2 from 
PM8| . The algorithm maintains a set of interpretations S C str{S') such that 
for each J G S', J ^ flat{T). The hypothesis is H = Aj^srel-cands{J) where 
rel-cands(J) is a set of clauses produced as follows. First take the conjunction of 
positive atoms true in J as an antecedent and an atom false in J as a consequent. 
Each such choice of consequent generates a ground clause. Considering each 
ground clause separately, substitute a unique variable to each object in the clause 
to get a clause in rel-cands(J) . 

We generate clauses over S by reversing the flat() operation; namely, replacing 
every literal fp{xi, . . . ,Xn,Xn+i) (where fp G Fp) by the corresponding literal 
{xn+i = f(xi, . . . ,Xn))- For C G rel-cands(J) let unfiat{C) be the resulting 
clause. Notice that unflat{C) is in H{S, =) but it may not be in Ti.R{S, =) since 
some of the variables in its equality literals may not be legal ancestors (cf. Defini- 
tionEJ. Since the clauses in question are in H{S,=), and since flat{unflat{C)) = 
C, the following is a special case of Lemma 0 

Lemma 4. For all I G str{S), for all J G str(S'), and for all C G rel-cands{J), 
I 1= unflat{C) if and only if flat{I) \= C. 

When applied in this way we can see that a hypothesis modified by unflatQ 
attracts precisely the same counter examples and we get learnability with ex- 
pressions over the signature S. A further improvement is needed to generate a 
hypothesis in T-Cr{S,=). Define legal objects of interpretation over S' in accor- 
dance with the definition for S (so that the same thing results when flattening) . 
Let J G str(S') be an interpretation with domain D. For D' C D let J^ri be the 
projection of J over D' . Namely, the interpretation where the domain is D' and 
an atom g(oi, . . . , a„), where oi, . . . , a„ G D' , is true in Jp/ if and only if it is 
true in J. 

Lemma 5. Let T G T-Ir{S,=) and J G str(S'), such that J |= Aunique- Let D 
be the domain of J and let a G D he a non-legal object in J . Then 
unflat{complete{J)) |= T if and only if unflat{complete{J\^R\a))) 1= T. 

Proof. Since a is non-legal in J if and only if it is non-legal in unfiat{complete{J)), 
Lemma 0 implies that if unflat{complete{J)) ^ T9 then 9 does not use a. 
Similarly, 9 does not use the object *. Since the extension of predicates and 
mapping of functions over the other objects is not changed it follows that 

unflat{complete{J\{D\a})) T9. 

For the other direction, if unflat{complete{J\{D\a})) ^ T9 then by the same 
argument 9 does not use the object *. Again, since the extension of predicates 
and mapping of functions over the other objects is not changed, 9 can be used 
in unfiat{complete{J)) without changing the truth value of T. ■ 
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Since a membership query of the algorithm (i.e. whether J ^ flat(T)) is trans- 
lated to a membership query for T (i.e. whether unflat{complete{J)) |= T) the 
lemma indicates that all non-legal objects can be dropped from J before making 
the membership query. This fact is utilised in the next section. 

For our current purpose it suffices to observe that in A2 dropping of objects 
happens by default. In particular, whenever the algorithm A2 (with its optional 
step taken) puts an interpretation J into the set S (that generates its hypothesis 
as discussed above), it makes sure that J ^ flat{T) and for every object a in 
the domain D of J, it holds that J|{D\a} |= flat{T). If this does not hold then it 
uses J\{D\a} instead of J. Therefore, by LemmaElwe get that all objects in all 
interpretations in S are legal objects. This in turn implies that the hypothesis 
is in Hr{S,=). 

Theorem 2. The class 7ifl(5, =) is learnable from equivalence and membership 
queries. 

For a clause C G TIr{S), by the number of distinct terms in C we mean the 
number of distinct elements in the set of all terms in C and all their sub- 
terms. For example, {p{x, fi{x), f 2 {fi{x)), fsQ) q{fi{x))) has 4 distinct terms 
xj30,hix)j2{fi{x)). 

Corollary 1. The class TLr{S) is learnable from equivalence and membership 
queries with hypothesis in TIr{S,=). The complexity is as in the previous the- 
orem where t is the maximal number of distinct terms in a clause in the target 
expression. 

Proof. Learnability follows since by Proposition [H every T G TLr{S) has an 
equivalent expression in TCr{S, =). It remains to observe that each distinct term 
in a clause C G TLr{S) is mapped to a variable in the equational form. ■ 

5 Learning from Entailment 

In this model examples are clauses in the underlying language Tl IIFP9dl . An 
example C G is positive for T ^TiiiT \= C . The equivalence and membership 
oracles are defined accordingly. For membership queries, the learner presents a 
clause C and the oracle EntMQ returns “yes” iff T |= C. For equivalence queries, 
the learner presents a hypothesis H G TC' and the oracle EntEQ returns “yes” 
if for all J, / 1= r iff J 1= H] otherwise it returns a counter example C such 
that T \= C and H ^ C {a, positive counter example) or T ^ C and H \= C {a 
negative counter example). 

Since one can identify non-legal objects and (by Lemma E) drop them before 
making a membership query, the following lemma indicates that we can replace 
MQ by EntMQ for clauses in Hr{S, =). 

Lemma 6. Let T G Hr{S,=) and let J G str{S') be such that J |= Aunique 
and all objects in J are legal objects. Then unflat{complete{J)) ^ T if and only 
if T \= unfiat{C) for some C G rel-cands(J). 
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Proof. Let I = unflat{complete{J)) . First note that by construction I ^ 
unflat{C) for all C S rel-cands{J). Hence if T |= unflat{C) for some such C 
then / ^ T. 

For the other direction, let 7 be the (reverse) substitution that is used when 
generating rel-cands(J). Let i? be a clause in T and 9 a substitution such that 
/ ^ R9. The antecedent of R is satisfied by 0 in / and, by Lemma Q] 9 does not 
use the object *. Therefore ant{R)9j C ant{unflat{C)) for all C € rel-cands{J), 
where ant{) refers to the antecedent part of the clause considered as a set of 
literals. (The resulting substitution 9j is a variable renaming that may unify 
several variables into one.) Since in rel-cands{R) all range restricted consequents 
are considered, we get that for some C G rel-cands{J), R9j C unflat{C), where 
here we consider a clause as a set of literals. (In other words R, 0-subsumes 
unflat{C) |Plo7flj .) We therefore get that T \= R\= R9j \= unflat{C'). ■ 

The following lemma provides a model finding algorithm for 

Lemma 7. Given H G =) and a clause C G =) such that H ^ C, 

one can find an interpretation I G str(S) such that I \= H and I Y= C in time 
0{\H\ ■ |5| • n*’*'’’) where \H\ is the number of clauses in H , n is the number of 
terms in C , t is the maximal number of variables in a clause of H , and r is the 
maximal arity. 

Proof. The idea is to generate an interpretation from C and then make sure (by 
forward chaining) that it is a model of H but not of C. 

Generate a structure /q G str{S) as follows. First, introduce a unique domain 
element for each term in C and then join elements if their terms are equated 
in the antecedent of C; let this domain be Z30 The extension of predicates in 
/q includes precisely the atoms corresponding to the antecedent of C and the 
mapping of domain elements produced. Let p be the (ground atom which is the) 
consequent of C under this mapping. The mapping of function symbols includes 
the initial mapping used when constructing D. It is then extended (as in the 
completeQ construction) by adding another domain element * and mapping each 
term /(oi, . . . , a„) that has not yet been assigned to *. Note that * is a non-legal 
object. 

Next, let I = Iq and run forward chaining on PI adding positive atoms to I. 
That is, repeat the following procedure: find a clause C in H and a substitution 
9 such that I C9 and add the atom corresponding to the consequent of C9 to 
I. This results in an interpretation I whose domain size is at most the number 
of distinct terms in C plus 1, and which is a model of H. This is true since H 
is definite and the domain of Iq is finite and hence by adding atoms to Jq we 
eventually get to a state where all clauses are satisfied (there is a finite number 
of atoms that can be added). We claim that p is not in I and hence / ^ C. 



^ Note that there is no need to perform equational reasoning here and a syntactic 
matching suffices. This is true since in PLr(S, =) all terms are of depth 0 or 1. 
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Since H G Hr{S,=), by Lemma ^ if / ^ H6 then 6 does not use the object 
*. Hence forward chaining does not produce any positive atoms containing the 
object *. Inductively, this shows that no such atom is true in I. 

Let J be some interpretation such that J \= H and J ^ C (which exists by the 
condition of the lemma). Let 9 be such that J ^ CO and let q be the consequent 
of C9. Clearly, q is not true in J. Moreover, there is a mapping from objects in 
Iq (apart from *) to the objects in J that are used in C9, so that all positive 
atoms true in Iq are true in J under this mapping, and all equalities true in Iq 
(apart from ones referring to *) are true in J under this mapping. Namely, a 
homomorphic embedding [( ;Kh()j oi flat(Io)\D into flat{J). 

Finally, assume that p is in I. Since its forward chaining does not use the object 
*, we can use the same chaining under the homomorphism to generate q in J, 
and therefore since J j= iL, g is in J, a contradiction. 

The complexity bound follows since in each iteration we can check whether 
forward chaining adds a new atom in time \H\n* and there are at most |5|n’' 
iterations. ■ 

The above process is similar of the use of the chase procedure to decide on 
uniform containment of database queries |Sag88| . Since we have access to EntMQ 
we can make sure that all clauses in the hypothesis of the algorithm are implied 
by the target function. (This essentially replaces the positive counter examples 
in the interpretations setting with EntMQ in the entailment setting.) Thus, the 
following lemma indicates that in the presence of EntMQ we can replace EQ by 
EntEQ. 

Lemma 8. Let T G ?{r(S,=), H G Ti.fi{S,=) and T \= H . Given a positive 
(elause) eounter example C G T-Lr{S,=) sueh that T \= C and H Y= C one ean 
find a negative (interpretation) counter example I such that I T and I H. 

Proof. This easily follows from the previous lemma since / ^ C and T \= C 
implies I Y= T. ■ 

We therefore get that the class is learnable. The complexity of the algorithm 
is as in the interpretation setting (though a slightly more careful argument is 
needed) . 

Theorem 3. The class TIr{S, =) is learnable from entailment equivalence 
queries and entailment membership queries. 

As before we can get a learnability result for TCr[S)] this time, however, we can 
use a hypothesis in 'Hr(S). Note that when learning T G TLr{S), interpretation 
counter examples constructed by the model finding algorithm have a special 
structure. In particular, since C G TCr{S) (in Lemma EJ every object in / (of 
Lemma Q has a unique term associated with it (as generated from C). It follows 
that in the clauses generated in rel-candsif) each variable has at most one defining 
equation. Therefore, the clauses can be “folded back” from the equational form 
into a range restricted form. 
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Theorem 4. The class Tia^S) is learnable from entailment equivalence queries 
and entailment membership queries. 



6 The Implication Problem 

Once expressions are learned as above one would want to use them as a knowledge 
base in a system, for example to infer properties implied by this knowledge. It 
is therefore useful if the implication problem is decidable and its complexity is 
bounded. It is easy to see that the model finding algorithm from Lemma 0 can 
be used to decide the implication problem. 

Corollary 2. The implication problem for =) is decidable in time 0{\H\ ■ 

|5| •n*+’'). 

We note that similar problems have been studied in database theory; checking 
whether I \= H corresponds to query evaluation, and checking whether H \= 
C corresponds to uniform containment iHm. Completeness results for these 
problems parametrised by the number of variables in a clause follow from IPVil7l . 

7 Discussion 

We have shown that weakly range restricted Horn expressions are learnable from 
equivalence and membership queries, both for learning from interpretations and 
for learning from entailment. In the special case where the target expression is 
range restricted, we can use range restricted expressions as the hypothesis lan- 
guage for learning from entailment. For learning from interpretations hypotheses 
are weakly range restricted. Our results use flattening and unflattening of exam- 
ples and clauses and a model finding procedure for this class. 

The learning algorithm is similar to the algorithm for learning from entailment in 
the propositional case as well as several previous ILP algorithms. In fact, 

the construction in Lemma 0corresponds to elaboration in |SH8B| and saturation 
in |Rou92| , and flattening has been used before in . The pairing procedure 

from |Khaf)8[ is similar to LGG computation |Flo70| used in many systems. In 
addition the dropping of non-legal literals is similar to what is done in in ouh2j. 
As we have shown a combination of these steps is formally justified in that it 
leads to convergence for range restricted Horn expressions. 

Previous work in !ArU17IHThHIH,SlT^ pursued similar problems in the context of 
learning from entailment. These works use oracles that are stronger than the 
ones used here in that they provide information on the syntax of the learned 
expression (using the order on atoms for acyclic expressions or otherwise infor- 
mation on subsumption rather than implication). On the other hand they derive 
complexity bounds that are lower than the ones here, in particular avoiding the 
exponential dependence on the number of terms in a clause. This is partly due 
to use of strong oracles and partly due to the the fact that different subclasses 
of Horn expressions are studied. 
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A natural question from the discussion above is whether the exponential de- 
pendence on the number of terms can be avoided without using the additional 
oracles. On the other hand, relaxing the requirement that clauses are range 
restricted is also of interest since many standard logic programs use recursive 
patterns that do not conform to it. Finally, in the model inference problem 
the learner is trying to acquire information about a model rather than a 
formula. In contrast with the scenario here the domain and mapping of function 
symbols are fixed and hence the nature of the problem is different. More work 
is needed to clarify these issues. 

References 



AFP92. 


D. Angluin, M. Frazier, and L. Pitt. Learning conjunctions of Horn clauses. 
Machine Learning, 9:147-164, 1992. 


AK95. 


D. Angluin and M. Kharitonov. When won’t membership queries help? 
Journal of Computer and System Sciences, 50:336-355, 1995. 


Ang88. 


D. Angluin. Queries and concept learning. Machine Learning, 2(4):319-342, 
1988. 


Ari97. 


H. Arimura. Learning acyclic first-order Horn sentences from entailment. 
In Proceedings of the International Conference on Algorithmic Learning 
Theory. Springer-verlag, 1997. LNAI 1316. 


CK90. 


C. Chang and J. Keisler. Model Theory. Elsevier, Amsterdam, Holland, 
1990. 


Coh95a. 


W. Cohen. PAC-learning recursive logic programs: Efficient algorithms. 
Journal of Artificial Intelligence Research, 2:501-539, 1995. 


Coh95b. 


W. Cohen. PAC-learning recursive logic programs: Negative result. Journal 
of Artificial Intelligence Research, 2:541-573, 1995. 


DR97. 


L. De Raedt. Logical settings for concept learning. Artificial Intelligence, 
95(1):187-201, 1997. See also relevant Errata. 


DRB92. 


L. De Raedt and M. Bruynooghe. An overview of the interactive concept 
learner and theory revisor CLINT. In S. Muggleton, editor, Inductive Logic 
Programming. Academic Press, 1992. 


DRD94. 


L. De Raedt and S. Dzeroski. First order jfc-clausal theories are PAC- 
learnable. Artificial Intelligence, 70:375-392, 1994. 


FP93. 


M. Frazier and L. Pitt. Learning from entailment: An application to propo- 
sitional Horn sentences. In Proceedings of the International Conference on 
Machine Learning, pages 120-127, Amherst, MA, 1993. Morgan Kaufmann. 


Hau89. 


D. Haussler. Learning conjunctive concepts in structural domains. Machine 
Learning, 4(l):7-40, 1989. 


Kha96. 


R. Khardon. Learning to take actions. In Proceedings of the National Con- 
ference on Artificial Intelligence, pages 787-792, Portland, Oregon, 1996. 
AAAI Press. 


Kha98. 


R. Khardon. Learning function free Horn expressions. Technical Report 
ECS-LFCS-98-394, Laboratory for Foundations of Computer Science, Ed- 
inburgh University, 1998. A preliminary version of this paper appeared in 
COLT 1998. 


Lit88. 


N. Littlestone. Learning quickly when irrelevant attributes abound: A new 
linear-threshold algorithm. Machine Learning, 2:285-318, 1988. 



Learning Range Restricted Horn Expressions 125 



Llo87. 

MB92. 

MDR94. 

MF92. 

Min88. 

NCDW97. 

Plo70. 

PY97. 

Ron92. 

RS98. 

RT97. 

RT98. 

RTR96. 

Sag88. 

SB86. 

Sha83. 

Sha91. 

Val84. 

Val85. 



J.W. Lloyd. Foundations of Logic Programming. Springer Verlag, 1987. 
Second Edition. 

S. Muggleton and W. Buntine. Machine invention of first order predicates 
by inverting resolution. In S. Muggleton, editor, Inductive Logic Program- 
ming. Academic Press, 1992. 

S. Muggleton and L. De Raedt. Inductive logic programming: Theory and 
methods. Journal of Logic Programming, 20:629-679, 1994. 

S. Muggleton and C. Feng. Efficient induction of logic programs. In S. Mug- 
gleton, editor, Inductive Logic Programming. Academic Press, 1992. 

J. Minker, editor. Foundations of Deductive Databases and Logic Program- 
ming. Morgan Kaufmann, 1988. 

S. Nienhuys-Cheng and R. De Wolf. Foundations of Inductive Logic Pro- 
gramming. Springer-verlag, 1997. LNAI 1228. 

G. D. Plotkin. A note on inductive generalization. In B. Meltzer and 

D. Michie, editors. Machine Intelligence 5, pages 153-163. American Else- 
vier, 1970. 

C. H. Papadimitriou and M. Yannakakis. On the complexity of database 
queries. In Proceedings of the symposium on Principles of Database Sys- 
tems, pages 12-19, Tucson, Arizona, 1997. ACM Press. 

C. Rouveirol. Extensions of inversion of resolution applied to theory com- 
pletion. In S. Muggleton, editor. Inductive Logie Programming. Academic 
Press, 1992. 

K. Rao and A. Sattar. Learning from entailment of logic programs with lo- 
cal variables. In Proceedings of the International Conference on Algorithmic 
Learning Theory. Springer-verlag, 1998. LNAI 1501. 

C. Reddy and P. Tadepalli. Learning Horn definitions with equivalence and 
membership queries. In International Workshop on Inductive Logic Pro- 
gramming, pages 243-255, Prague, Czech Republic, 1997. Springer. LNAI 
1297. 

C. Reddy and P. Tadepalli. Learning first order acyclic Horn programs from 
entailment. In International Conference on Inductive Logic Programming, 
pages 23-37, Madison, WI, 1998. Springer. LNAI 1446. 

C. Reddy, P. Tadepalli, and S. Roncagliolo. Theory guided empirical 
speedup learning of goal decomposition rules. In International Conference 
on Machine Learning, pages 409-416, Bari, Italy, 1996. Morgan Kaufmann. 
Y. Sagiv. Optimizing datalog programs. In J. Minker, editor. Foundations 
of Deductive Databases and Logic Programming. Morgan Kaufmann, 1988. 
C. Sammut and R. Banerji. Learning concepts by asking questions. In 
R. Michalski, J. Carbonell, and T. Mitchell, editors. Machine Learning : 
An AI Approach, Volume II. Morgan Kaufman, 1986.. 

E. Y. Shapiro. Algorithmic Program Debugging. MIT Press, Cambridge, 
MA, 1983. 

E. Shapiro. Inductive inference of theories from facts. In J. Lassez and 
G. Plotkin, editors, Computational Logic, pages 199-254. MIT Press, 1991. 

L. G. Valiant. A theory of the learnable. Communications of the ACM, 
27(11):1134-1142, 1984. 

L. G. Valiant. Learning disjunctions of conjunctions. In Proceedings of the 
International Joint Conference of Artificial Intelligence, pages 560-566, Los 
Angeles, CA, 1985. Morgan Kaufmann. 




On the Asymptotic Behavior of a Constant 
Stepsize Temporal-Difference Learning 
Algorithm 



Vladislav Tadic 

Mihajlo Pupin Institute, Volgina 15, 11000 Belgrade, Serbia, Yugoslavia 
etadicvSubbg . etf . bg . ac . yu 



Abstract. The mean-square asymptotic behavior of constant stepsize 
temporal-difference algorithms is analyzed in this paper. The analysis is 
carried out for the case of a linear (cost-to-go) function approximation 
and for the case of Markov chains with an uncountable state space. An 
asymptotic upper bound for the mean-square deviation of the algorithm 
iterations from the optimal value of the parameter of the (cost-to-go) 
function approximator achievable by temporal-difference learning is de- 
termined as a function of stepsize. 



Keywords. Temporal-difference learning, reinforcement learning, dy- 
namic programming, cost-to-go function approximation, Markov chains. 



1 Introduction 

Predicting the expected long-term future cost of a stochastic system modelled 
as an uncontrolled Markov chain is a problem of great importance in the fields 
such as time-series analysis and automatic control (see e.g., 0) These predic- 
tions have typically the form of a cost-to-go function, which itself is of a central 
role in the area of dynamic programming (see e.g., j2]). Several methods have 
been developed for predicting the values of a cost-to-go function associated to 
an uncontrolled Markov chain: Monte Carlo and maximum likelihood methods 
in the fields of statistics and automatic control (see e.g., |Zj) and temporal- 
difference learning in the area of machine learning (see e.g., |3|, POl)- Among 
them, temporal-difference is probably the most efficient and undoubtedly the 
most general and simplest to be implemented. Basically, temporal-difference 
learning algorithms are a parametric recursive method for approximating the 
cost-to-go function (associated to an uncontrolled Markov chain) as a function 
of the (chain) current state. Aiming at improving the approximations of the 
cost-to-go function on the basis of collected observations, these algorithms up- 
date the parameter of the function approximator whenever the observation of 
the chain transition and associated cost becomes available. 

Due to good performances and a wide range of applications, various properties 
of temporal-difference learning algorithms have extensively been considered in 
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a great number of papers (see e.g., and references cited therein; see the 

same references for details on the application of temporal-difference learning). 
Their convergence properties (almost sure convergence, convergence in mean, 
convergence of mean and rate of convergence) have been studied in 0-0 and 
0 -IE]. However, the analysis presented in these papers corresponds exclusive- 
ly to decreasing stepsize temporal-difference algorithms. Since the decreasing 
stepsize algorithms typically exhibit a poor convergence rate, the analysis of the 
corresponding constant stepsize algorithms is much more interesting (at least 
from the application point of view) . 

In this paper, the mean-square asymptotic behavior of constant stepsize temporal- 
difference algorithms is analyzed. The analysis is carried out for the case of a 
linear (cost-to-go) function approximation and for the case of Markov chains with 
an uncountable state space. An asymptotic upper bound for the mean-square 
deviation of the algorithm iterations from the optimal value of the parameter of 
the (cost-to-go) function approximator achievable by temporal-difference learn- 
ing is determined as a function of stepsize. The results presented in this paper 
are an extension of the results of US] to the constant stepsize algorithms and 
to the case of Markov chains with an uncountable state space. Other problem- 
s related to temporal-difference learning algorithms (almost sure convergence, 
rate of convergence, non-linear function approximation and stabilization) are 
considered in mi-ini 

2 Algorithm and Assumptions 

Temporal-difference learning algorithms with function approximation are defined 
by the following difference equations: 

T ^ ^ O 5 (1) 

Sn+1 = g{Xn+l, Xn+ 2 ) + af{9n, Xn+ 2 ) — fi^n, Xn+l), U > 0, (2) 

n 

e„+i = ^(oA)"-*V,/(0„,A,+i), n>0. (3) 

i=0 

f : R‘^ X ^ R and g : R^ x R'^ — s- i? are Borel-measurable functions 
and X € R'^ , is differentiable, a € (0,1), A G [0,1] and 7 G (0,oo) 

are constants, while the algorithm initial value Oq is an arbitrary vector from 
R‘^. {Xn}n>o is an R'^ -valued random process defined on a probability space 
{f2,R,V). 

Let /*(a;) = oi^g{Xn, A„+i)|Ao = x), a; G , be a discounted cost-to- 

go function associated to {Xn}n>o- The algorithm dO - Q aims at determining 
the parameter 9 G R‘^ such that f{9, ■) approximates /*(•)• If A = 1, this reduces 
to determining 9 G R‘^ such that f{9,-) approximates /*(•) optimally in the L‘^{g) 
sense, i.e., to the minimization of the criterion function J*( 0 ) = J{f{9,x) — 
9 ‘{dx), 9 G R‘^ (for details see fS])- 

In this paper, the algorithm © - (0 is analyzed for the following case: 
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(i) {Xn}n>o is a homogenous Markov chain, 

(ii) f{9,x) = 9’^cj){x), V9 e R‘^, Vx G R‘^' , where 4> : R‘^' — s- is a Borel- 

measurable function. 

Let and Rq be the sets of positive and non-negative reals (respectively), 
while II • II denotes the Euclidean vector norm, the Frobenius matrix norm and 
the total variation of signed measures. Let P{x,-), x G R'^ , he the transition 
probability of {Xn}n>o (i-e., V{Xn+i G B\X„) = P{Xn,B) w.p.l, \/B G , 
n > 0), while g{x) = f g(x, x')P(x, dx'), x G R‘^ . For x G R'^ and B G , let 
Pq{x,B) = Ib{x) and Pn+i{x,B) = f Pn(x',B)P(x,dx'), n > 0. For x G R‘^ , 
let {Pn4'){x) = J (j){x')Pn{x,dx'), {Png){x) = J g{x')Pn{x,dx') and 4>n{x) = 
a{Pn+i(t>){x) — {Pn(j)){x)^ n > 0. The assumptions under which the algorithm (P) 
- ® is analyzed are as follows: 

Al. g{-,-) and (/)(•) are locally hounded. 



A2. {Xn}n>o has a unique invariant measure 

A3. There exists a constant r G i?"*" such that g,{x : ||a;|| > r) = 0 and P{x,x' : 
||a:'|| > r) = 0, Vx G R'^' . 

A4. There exists a locally bounded Borel-measurable function if : R'^ ^ Rq 
such that 

OO 
2=0 




4>{x'){Png){x'){Pi - g){x,dx') 



< tf{x); Wx G R'^ , n > 0. 



cj){x'){Pn(t)'^){x'){P^ - fj.){x, dx') 



< 4>{x)\ 'dx G R‘^ , n > 0, 




A5. J (f){x)(j)^ (x) iJ,{dx) is negative definite. 

Remark 1. EH and E3 imply 

Pn{x,x : ||a;'|| > r) = 0; dx G R'^ , n> 1, (4) 

sup / \g{x,x')\P{x,dx') < sup |g(a;, a;')| < oo, Vt G i?'*". (5) 

||a:||<tJ IkllGt.lk'IIGr- 



sup j j \g{x' ,x")\P{x' ,dx")Pn{x,dx') 

xeR'^' d J 

< sup |5(x, x')| < OO, n > 1, (6) 

IPlUP'IIGi- 

sup / ||<()(a;)||P„(a;, da;') < sup ||(()(a;)|| < oo, n > 1, (7) 

xGfi'*' J IkllGr 

wherefrom it can easily be deduced that g{-) and the terms of the sums appearing 
in El are well-defined and finite. 
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LA II corresponds to the properties of g{-, •) and while IA2l - IA5I refer to those 

of f lA II represents probably the mildest assumption on and (/)(•) 

under which the algorithm 0-0 can still be analyzed. On the other hand, 
^21 requires {Xn}n>o to exhibit a certain asymptotic stationarity, while vir- 
tually demands this “asymptotic steady-state” to be reached sufficiently fast. 
The assumptions of this kind are natural for learning in the Markovian envi- 
ronments and standardly appear in the analysis of the corresponding (learning) 
algorithms (for details see and references cited thereinl . lAHI reauires {Xn}n>o 
to be essentially bounded. Since any implementation of the algorithm 0-0 
involves a uniform truncation of the input data (due to the finite precision of 
implementing machines), this requirement is not restrictive (at least not from 
the application point of view) . I A 51 can be thought of as a form of “persistency 
of excitation” condition (for details see e.g., 0, |E]). If {Xn}n>o has a finite 
state-space {xi, . . . it reduces to the requirement that /r(x = Xi) > 0, 

1 < i < m, and that \(j){xi) . . . (j){xm)\ is a full row-rank matrix (i.e., that its 
rows are linearly independent). 

Using lA.SI and 0, it can easily be deduced that 

J (j){x'){Pn(l>^){x'){P^ - g){x, dx') 

< C'^\\{P,- gd){x,-)\\] Vx G R'^' , n > 0, i > 1, 



j 4’{x'){Png){x'){Pi- g){x,dx') 

< C'^\\{Pi — g){x, •)||; Va; G , n > 0, i > 1, 

where C = sup|| 3 .||<,. max{|g(x)|, ||(/>(a;)||}. Then, it is obvious that lA4l holds if lA2l 

and ESI are satisfied and if ll(^* ~ Oil < oo, Vcc G . Consequent- 
ly, ESI- El are satisfied if {Xn}n>o is a geometrically ergodic Markov chain 
with bounded state-space, which itself includes the case of irreducible aperiodic 
Markov chain with finite state-space (see e.g. 0). Moreover. lA 1 1 - with the 
exception of only El cover the assumptions adopted in m- 

3 Analysis 

For X, x' G R'^' , y G R^^ and z = {x,x',y), let U{z) = y{a<j){x') — (j){x))'^ and 
u(z) = yg{x,x'), while 

n{z,B) = J lB{x\x",aXy + (t){x'))P{x',dx"), VU G 
Let Zn = {Xn, Xn+i,en), Ti > 0, while 0* = —Up^u^ and 

/ OO n 

cj){x)cj)^{x)ix{dx) {I- X)a^^{aX)'^ / cj){x){Pn^i(t)^){x)iJi{dx), 
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u, = ^(aA)" 

n— 0 

Then, it can easily be deduced that {Zn}n>o is a homogenous Markov process 
satisfying V{Zn+i S B\Zn) — II{Zn,B) w.p.l, Vi? G Qd+ 2 d' ^ n > 0. For an 
interpretation of 0* and an upper bound for the error in approximating /»(•) by 

see |Ej. 

Lemma 1. Lp.f AA l\ - lA:^ nnd \A fi hnld. Then, /*(■)? U*j andO,^ are well-defined 
and finite. 

Proof. Using 0) - O, it can easily be deduced that there exists a locally bound- 
ed Borel-measurable function : i?^ — > Rq such that 

max{|(P„ 5 )(a;)|, ||(P„0)(a;)||} < \/x € , n > 0. (8) 



J J (fix){Png){x)fi{dx). 



Let K = sup||,,,||<^ if{x). As 

/ OO 

E[Y,al9{X^,Xn+i)\ 



\n—0 



Xq = X 



= ^ a" / / \g{x' ,x")\P{x' ,dx'')Pn{x,dx'); \/x G R^ , n > 0, 



n—0 



J U{x)fX ix)\\fi{dx) < A:^ 

OO „ 

^(oA)" / \\f>{x){Pn+i4>^)ix)\\g{dx) < (l-aA)"^A:^ 

OO p 

^(aA)" / \\(l>{x){Png){x)\\g{dx) < (1 - aA)"^A:^ 

n^O 

(due to IA3I and (0)) it is obvious that [/* and rt* are also well-defined and finite. 
On the other hand, owing to the Lyapunov inequality, it is obtained 



J {Pn(j)){x)Y g{dx) < J J {B'^(j){x'))'^Pn{x,dx')fi{dx) 

= J {B'^ 4>{x))'^ g{dx); V0 G R‘^, n>0. 



Consequently, 



< 



< 



B’^(j){x) {Pn (f^) {x)Bg{dx) 

^{x)f Kdx'^ {B"^ {Pn(j)){x)f g{dx'^ 

J {B'^ 4>{x))'^ g{dx); V0 G R‘^, n>0. 



1/2 
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Then, it can easily be deduced that 

OO 

6)'^C/*6»= (1- A)a^(aA)" 

n— 0 

- J {0'^ n{dx) 

< —(1 — a)(l — aA)“^ J {9^ (j){x)Y ^{dx) 

= —(1 — a)(l — QfA)“^0^ (j){x)(j)^ {x)fi{dx)^ 9, S 

wherefrom it immediately follows that [/» is negative definite and that 0* is 
well-defined and finite. 



9'^ 4>{x){Pn+i<l))ix)9fi{dx) 



Lemma 2. Let lA 11 - \X^ hold. Then, (7T”C/)(-) and (7T"u)(-) are well-defined, 
finite and satisfy the following relations for all x, x' G R‘^ , y G R‘^ and z = 
{x,x',y): 



(iT"+i[/)(z) = ^(aA)* 
2=0 



4>{x”)cj)f {x")Pn-t{x' , dx") 



-h(aA)”+^?/0^(a;), n > 0, 



( 9 ) 



(77"+^)(z)=^Mr 

2=0 

+ {a\)'^^^y{Png){x'), n > 0. (10) 

Proof. Due to lAlL IXlTl and @ - O, it is obvious that {U'^U){z), {U'^u){z) and 
the right-hand sides of (0) and l| 1 1)11 are well-defined and finite for all x, x' G R‘^ , 
y G R‘^, z = (x,x',y) and n > 0 (notice that IA ll a.nd lA.Sl implv that [/(•) and u(-) 
are locally bounded and that n”-{z, •) is compactly supported for all z G , 

n > 0). The relations (0 and (mu themselves will be proved by mathematical 
induction. 

It can easily be deduced from the definition of Pi(-, •) that the following relations 
are satisfied for all x, x' G R'^ , y G R'^ and z = (x, x',y): 

{n^U){z) = J {aXy (j){x')){a(j){x") — P{x' ,dx") 

= {aXy -h (j){x)){a{Pi(j)){x') - (fix))"^ 

= J (j){x")(j)o {x")Po{x' , dx") -G aXyfioix), 



J (j>{x"){Pig){x')Pn-i{x ,dx") 
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{n^u){z) = J {aXy + 4>{x'))g{x' ,x")P{x' ,dx") 

= {aXy + (j){x'))g{x') 

= aXyg{x')+ J (j){x")g{x")Po{x ,dx"). 

Now, let us suppose that (EJ and (11 ( )I1 hold for some n > 0 and for all x, x' G R‘^ , 
y G R‘^ and z = (x,x',y). Then, it can easily be deduced from the definition of 
7T(-, •) that the following relations are satisfied for all x, x' G R'^ , y G R'^ and 
z = (x,x',y): 

n 

(7T"+i[/)(z) = ^(aA)* 

+(aA)"“'‘^ J {aXy + (j>{x'))(j)^{x")P{x' , dx") 

n 

= 

+(aA)"+^(aAy + (j){x'))4>l+^{x') 

n+l „ 

= '^{ctXy J 4>{x'')4iJ {x'')Pn-t+l{x' , dx”) 

i—0 

+{aX)”+^y^l^,{x'), 

n 

{n”+^u){z) = Y,{axy 

+(aA)”+^ J {aXy + (j){x')){Png){x”)P{xy dx”) 

n 

= 

i^O 

+(aA)”+^(aAy + (j){x')){Pn+ig){x') 

n+l . 

= E(“^)“ / 4>{x''){P^g){x”)Pn-^+l{xydx”) 

i=0 

+{aX)”+'^y{Pn+ig){x'). 

This completes the proof. 

Lemma 3. Lef lA 1\ - \A4\ hold. Then, there exist locally bounded Borel-measurable 
functions V : j^dxd ^ . j^d+ 2 d' j^d {nV){-) and 

(7Tu)(-) are also locally bounded and such that 

u{z)-u^ = v{z)-{nv){z), Vzei?^+2'^', (11) 



J 4>{x”){Pig){x”)Pn-z+i{x' ,dx”) 



J J 4>{x'”){Pig){x'")Pn-i{x'' ,dx'")P{x' ,dx") 



J (t){x”)cj)f {x”)P„-^+i {x\ dx”) 



J J 4>{x"')4>J {x"')Pn-i{x" ,dx'”)P{x' ,dx”) 
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u{z) — u* = v{z) — (IIv)(z), Vz G ^ ^ 2 ^ 2 ) 

Proof. It can easily be deduced from that there exists a locally bounded 



Borel-measurable function ip ■. R'^ ^ Rq such that 



max{\\{Pn(j)){x)\\, \{Png){x)\} < if{x); \/x € R'^ , n>0. 



(13) 



Let K = sup|| 2 ,||<^ Then, it follows from IA3I and that 

OO ^ 

j \\^{d:){Pi+j(l)){x)\\n{dx) < (l-aA)"^(aA)"+^iL^; n,j > 0, (14) 



OO 



^ {oiXy J \\f>{x){Pig){x)\\n{dx) < (l - aX) ^(oA)”+^A:^, n > 0, (15) 



while I A 41 implies 

OO n 
n—0 i—0 






(j){x'){Pi+j(jP){x'){Pn-i - g){x,dx') 



(j){x'){Pi+j(jP){x'){Pn - g){x,dx') 



i—O n—0 

< (1 — aX)~^'ijj{x); yx G R‘^ , j > 0, 



(16) 



n—0 i —0 

OO 



(j){x'){P^g){x'){Pn-i - g.){x, dx') 



4>{x'){P^g){x'){Pn - n){x,dx') 



i—0 n—0 

A — ^ TDd' 



< (1 — q;A) tpix), Vx G i?“ . 

On the other hand, using Lemma El it is obtained 

(77"+iC/)(z) -U. = J^(aXy / y(x")y[(x")(P„., - /i)(x',dx") 

i^O 

OO „ 

i^n +1 

+(aA)"+^y^^(x'), n > 0, 

n „ 

(7T”+^m)(z) - m* = ^(oA)* / (!){x"){Pig){x"){Pn-i - /r)(x',dx") 

r» d 



(17) 



z=0 



OO „ 

- X! / 4 ’{x”){P^g){x”)^J^{dx”) 

j^n+1 

+ (aA)”+^y(P„g)(x'), n > 0, 
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for all X, x' G y G and z = {x,x',y). Then, owing to - (IT^ . it 
can easily be deduced that the following relations also hold for all x, x' G R‘^ , 
y G R‘^ and z = (x, x', y): 

n 

|l(7T"+iC/)(z)-C/*|| <^(aAr 

i^O 

n 

+ '^{aX) 

i^O 

(p{x')\\y\\ + 2(aA)"'''^(l — aA)“^iG^, n > 0, 



J cj){x”){Pi+i(l)^){x”){Pn-i - y-){x', dx”) 
‘ J 4>{x''){Pi(j)^){x''){Pn-^ - fJ.){x' ,dx”) 



\\in-+^u){z)-u4<Y.(^xT 

i=0 

+ (aA)"+V(*')lly|l + (aA)"+i(l - aX)~^K^, n > 0 . 

Therefore and owing to ll I dll and dni), it is obtained that 

OO 

^ ||(7T"C/)(z) -U4< 2(1 - aA)-i(aA(l - aX)~^K^ + aX<f{x')\\y\\ + ^{x')), 

n—1 

OO 

\\{n^u){z) - it*|| < (1 - aA)“^(aA(l - aA)“^iir^ + aXip{x')\\y\\ + i}{x')) 

n—1 

for all X, x' G R'^ , y G R'^ and z = {x,x',y). Let V{z) = ~ ^*) 

and v{z) = ~ ^*)> ^ ^ ^hen, it can easily be deduced 

that V (•), {nV){-), v{-) and (7Tu)(-) are well-defined, locally bounded and satisfy 
(im and (IT^ (notice that {IIV){z) = J2^=iii^’^U){z) — [/*) and {IIv){z) = 
EZiiin^^)i^)-u4,yzGR‘^+^^'). 

Theorem 1. Let L4 11 - IT 51 hold and let 9o be an arbitrary deterministic vector 
from R‘^. Then, there exist constants 7 *, C* G i?"*" (depending on a, X, r, 
g{-,‘) and and not depending on 7 and 6 q) such that 

lim E\\9n - < C'* 7 ) V 7 G ( 0 , 7 ,). 

n — *-oo 

Proof. Let —Xm denote the largest eigenvalue of C/*, while K = sup|| 3 ,||<j. 

Let 

L = sup{|!t/(z)||, ||4z)||, ||n^)||, ||u(z)|i, ||(77F)(z)||, ||(7Tu)(z)|| 
:||z||<2r-|-(l — aX)~^K} 

and M = {1 + L)(l -|- ||0*||). Let 7 * = min{2A“^, 2“^M“^} and suppose that 
7 G (0,7*). Let Tn = cr{^ 0 ,---,^n}, = U{Z4j, Bn = V(Zn) and Bn = 

{nV){Zn), n > 0, while a„ = U{Zn)9^ + u{Zn), bn = V(Z„)0* -|- u(Z„) and 



j 4>4''){Ptg){x''){Pn-i - dx”) 
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bn = {nV){Zn)0* + {IIv){Zn), n > 0. Since ||^n|| < r w.p.l, n > 1 (due to Q), 
it is obtained 



||£n+i|l < ^(aA)"-l<^(X,+i)|| < (1 - aXy^K w.p.l, n > 0. 
i=0 

Consequently, \\Zn\\ < 2r + (1 — aX)~^K w.p.l, n > 1, wherefrom 

max{||A„||, ||a„||, \\Bn\\, ||6„|| , ||B„|| , ||6„|| } < M w.p.l, n>l, (18) 

immediately follows. Let Xn = On — 0^, and = ||a;n|P + + 2^x^bn, 

n > 0. Then, it is straightforward to verify that 

Xn+l — Xn T 7(^n+l3^n T ^ ^ 0, (19) 



Ln_|_l — Tn 2^XnU:^Xn ((i?n-|-i Bn)Xn T ^n+1 ^n) 

+7^\\An+lXn + Cln+1 |P 

+ 27 ^((B„_i_i + Bn_^i)xn + 5„+i)^(A„-|_ia;„ + a„+i) 
+27^(A„+ia;„ + a„+i)^i?„+i(A„+ia;„ + a„+i), n > 0. (20) 

Using IIISII . Ill Dll and the fact that Oq is deterministic, it can easily be deduced 
that a{0n} C J^nj n > 0, and 



\\xn\\<{l + 7Kn\\0o\\ + \\O4) + jKY,{l + jKy, n>0. (21) 

1=1 

Therefore and owing to the fact that E{Bn+i\Bn) = Bn and E{bn+i\En) = bn 
w.p.l, n > 0 (notice that E{V{Zn+i)\B'n) = (7TU)(Z„) and E{v{Zn+i)\lFn) = 
{IIv){Zn) w.p.l, n > 0), it is obvious that 

E{xy{Bn+l- Bn)Xn\^n) =0 w.p.l, n > 0, 

E{xy{bn+1 -bn)\Bn) = 0 w.p.l, n > 0. 

Consequently, 

E{Xn {{Bn+l - Bn)Xn + bn+1 - 6n)) = 0, 71 > 0. (22) 

On the other hand, (HE!) implies 

||^n+l2;n + fln+llP ^ 2M^(1 + IIXjilP), 77 > 0, (23) 

“t” ^n+l) Bn+l{An+lXn + Cln+l)\ < 2M^{1 + WXnW"^), 77 > 0, (24) 



\{{Bn+l+By^,)Xn + bn+lf{A n+lXn + On+l)| 

< M^(l + 2||a;„||)(l + ||x„||) < 4M^(1 + ||a:„|p), 77 > 0, 



(25) 
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|T„ - I < 2M-i\\xnf + 2M-i\\xn\\ 

<4M7||a;„f + 2M7<2-i||a;„f + 2M7, n>l ( 26 ) 

(notice that 4 M 7 < 2“^). Since < — Am||a;„|p, n > 0, it directly follows 

from l|21)ll and (I22II - (I26H that 

E{Tr,+i) < E{T^) - 2 \^^E\\x^f + 16 MV (1 + EWxnW^) 

<i;(r„)-A^ 7 S||a:„f + 16 MV, n>l ( 27 ) 

(notice that 7 < 1, \m > 16M^7 and that H21I1 and (EEJ imply if|T„| < 00 , 
n > 1), while ( I26|l yields 

E\\xn\\'^>2-^E{Tr,)-M-f, n>l, ( 28 ) 

E\\xnf < 2E{Tr,) + AMj, n > 1. (29) 

Due to (EZJ and (I21SI1 . it is obtained 

£l(r„+i) < (1 - 2-1A™7)^;(T„) + M(A™ + 16Af2)72, n > 1, 



wherefrom 

n -2 

E{Tr,) < (1 - 2-^\mlT-^E{T^) + M{\m + IBM^) ^(1 - 2-^\ml)\ n>l, 

2=0 



immediately follows. Therefore, 

im E{Tn) < 2M(1 + leM^Ay )7 

n — >^00 

(notice that 2“^Am7 < 1)> which, together with (EHJ, implies 

lim E\\9n-d*\\'^= lim if||a:„|p < 8M(1 + SM^A^ ) 7 - 

n — >-oo n— ^■oo 

This completes the proof. 

4 Conclusion 

The mean-square asymptotic behavior of constant stepsize temporal-difference 
algorithms has been analyzed in this paper. The analysis has been carried out 
for the case of a linear (cost-to-go) function approximation and for the case of 
Markov chains with an uncountable state space. An asymptotic upper bound 
for the mean-square error deviation of the algorithm iterations from the optimal 
value of the parameter of the (cost-to-go) function approximator achievable by 
temporal-difference learning has been determined as a function of stepsize. The 
results presented in this paper are an extension of the results of [ig to the con- 
stant stepsize algorithms and to the case of Markov chains with an uncountable 
state space. 
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Abstract. It is easy to design on-line learning algorithms for learning 
k out of n variable monotone disjunctions by simply keeping one weight 
per disjunction. Such algorithms use roughly 0{n^) weights which can 
be prohibitively expensive. Surprisingly, algorithms like Winnow require 
only n weights (one per variable) and the mistake bound of these al- 
gorithms is not too much worse than the mistake bound of the more 
costly algorithms. The purpose of this paper is to investigate how the 
exponentially many weights can be collapsed into only 0(n) weights. In 
particular, we consider probabilistic assumptions that enable the Bayes 
optimal algorithm’s posterior over the disjunctions to be encoded with 
only 0{n) weights. This results in a new 0{n) algorithm for learning 
disjunctions which is related to the Bylander’s BEG algorithm originally 
introduced for linear regression. Beside providing a Bayesian interpreta- 
tion for this new algorithm, we are also able to obtain mistake bounds 
for the noise free case resembling those that have been derived for the 
Winnow algorithm. The same techniques used to derive this new algo- 
rithm also provide a Bayesian interpretation for a normalized version of 
Winnow. 



1 Introduction 

We consider the problem of learning k out of n variable monotone disjunctions, 
where k is typically much smaller than n, in an on-line setting. In this setting 
learning proceeds in a sequence of trials; on each trial the learning algorithm 
observes a boolean instance, predicts the instance’s classification, and then is 
told the correct classification for the instance. 

Most on-line learning algorithms use a set of weights or parameters to repre- 
sent their current hypothesis. In this paper on-line learning algorithms always 
have three parts: a prediction rule which maps the instance and weights to a 
prediction, an update function which specifies how the algorithm’s weights are 
modified, and an update policy indicating when the update function should be 

* The first and third authors are supported by NSF grant CCR 9700201. The second 
author is supported by a research fellowship from the University of Milan and by a 
Eurocolt grant. 



P. Fischer and H.U. Simon (Eds.): EuroCOLT’99, LNAI 1572, pp. 138-^^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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applied. The update policies considered in this paper are: 1) update after each 
trial, and 2) only update after trials where the algorithm makes an incorrect 
prediction. Algorithms with the latter policy are called mistake- driven (or con- 
servative) |Lit89ILit^ . 

When learning monotone disjunctions, some algorithms keep one weight per 
disjunction (i.e. a total of (^) weights). We call such algorithms direct algorithms 
since the weights directly encode the confidence in or likelihood of each individual 
disjunction. 

There are other algorithms that learn disjunctions while maintaining only 0(n) 
weights. We call such algorithms indirect algorithms since they indirectly encode 
their confidences in the disjunctions using 0(1) weights per variable. Surprisingly 
these more efficient algorithms learn disjunctions almost as well as the direct 
algorithms. The first such indirect algorithm was Littlestone’s Winnow algorithm 

Ibitssibitshl . 

In this paper we are primarily interested in a performance criteria that makes 
no probabilistic assumptions about how the data is generated. On the contrary 
the examples can be chosen by an adversary and the goal is to make relatively 
few mistakes compared to the number of mistakes made by the best monotone 
disjunction on the sequence of examples being observed |Lit88ILitl^ . 

The Bayesian approach is a popular way to design and analyze on-line algorithm- 
s. Bayes learning algorithms use probabilistic assumptions about the world and 
data observed in past trials to construct a posterior distribution over the class 
of disjunctions. These algorithms then predict the most likely classification with 
respect to the current posterior. It is well known that when the instances are 
generated and labeled according to the probabilistic assumptions, then Bayes 
algorithm minimizes the expected total number of mistakes. 

By comparing the world model assumed by a Bayes algorithm to the actual situ- 
ation, one can get important intuition about how well (or poorly) the algorithm 
will perform. Relative mistake bounds give a much different kind of intuition, 
and their worst-case nature may be overly pessimistic. Relating these two styles 
of algorithms will give important insight into existing algorithms and lead to 
new approaches for designing learning algorithms. 

For many direct algorithms with good relative mistake bounds it is easy to 
work out a nice Bayesian interpretation for the algorithms’ prediction rule and 
update function by making appropriate probabilistic assumptions on how the 
data is generated. For instance, the direct Weighted Majority (WM) [II ;W 94] 
algorithm’s weights are posterior probabilities over the set of disjunctions of 
up to k variables under the assumption of i.i.d. label noise with a known rate. 
The algorithm predicts with the label having the highest posterior probability. 
Although the direct WM algorithm has a clean Bayesian interpretation, until 
now, it has been unclear if there also exists a Bayesian interpretation for the 
more efficient indirect algorithms which have good relative mistake bounds. 

In this paper we present a general technique for deriving indirect algorithm- 
s from Bayes optimal algorithms that make certain probabilistic assumptions 
about how the instances and labels are generated. In particular, we show that 
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with some independence assumptions, the posterior distribution over monotone 
disjunctions kept by the direct Bayes algorithm can be encoded with only 0{n) 
weights. These assumptions lead to indirect algorithms whose updates and pre- 
diction functions have a clear Bayesian interpretation. Our technique has been 
applied to derive two indirect algorithms whose updates and prediction functions 
coincide with those used by Normalized WinnowQ (first analyzed in IILithhal ) and 
a new classification variant of Bylander’s BEG algorithrr0 |Byl97| (two indirect 
algorithms with good relative loss bounds). This suggests that there may be 
more indirect algorithms that combine the strengths of the Bayesian and rela- 
tive mistake bound settings. 

It is important to observe that the similarity between these algorithms does 
not extend to the update policy. All known indirect algorithms with good rel- 
ative mistake bounds must use the mistake-driven update policy, and all Bayes 
algorithms update their posteriors after each trial. 

The classical method for using independence assumptions to simplify the direct 
Bayes algorithm gives the indirect Naive Bayes algorithm. However, no relative 
loss bounds have been proven for Naive Bayes or its mistake-driven variant. 
The mistake-driven variant has performed better in experiments, but both ver- 
sions are very sensitive to redundant attributes and neither performs as well as 
Winnow [I;it95j . 

The precursor of this research is a paper by Nick Littlestone |Litf)h| (see al- 
so [IjIVIDT] ) in which he uses a Bayesian approach to derive an indirect predic- 
tion algorithm, the Singly Variant Bayes algorithm (SVB), for learning linearly 
separable functions (which include disjunctions). Rather than using a prior over 
the set of all monotone disjunctions, the SVB algorithm uses a uniform prior 
over the set of disjunctions of size one. This leads to a different style of indi- 
rect update than the ones considered in this paper. A good mistake bound for 
learning disjunctions with SVB has been proven only for the noise-free case, and 
Winnow’s bound is much better when learning disjunctions. 

The next two sections review the on-line learning of disjunctions and the direct 
Bayes algorithm. Our general technique for deriving indirect algorithms from 
direct Bayes algorithms is presented in Section 4. To keep the presentation as 
simple as possible, we specialize the presentation to derive the linear threshold 
classification algorithm related to Bylander’s BEG algorithm ||Byl97| . In Section 
5 we briefly describe how the same technique can be applied to obtain a Bayesian 
interpretation of the normalized variant of Winnow. 



^ Normalized Winnow is identical to Winnow except that for computing its linear 
threshold predictions it uses the normalized instead of the un-normalized weights. 

^ Throughout this paper we call the algorithm using the update function of Figure H 
BEG because it is related to the update used by Bylander’s Binary Exponentiated 
Gradient algorithm for linear regression. When the gradient w.r.t. the square 

loss used in the derivation of Bylander’s algorithm is replaced by the gradient w.r.t. 
the “linear hinge loss”, we get the update function of Figure Q This “linear hinge 
loss” can be used to motivate other linear threshold classification algorithms such 
as the Perceptron algorithm and Winnow |(f Wt)S| . 
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2 An Overview of On-line Learning of Disjunctions 

In the Mistake Bound model introduced by Littlestone ILit88ILit89l . the goal 
of the learner is to make a number of mistakes not much greater than the best 
classifier in some comparison class. In this paper we use monotone disjunctions 
over n variables as the comparison class. Such disjunctions are boolean formulas 
of the form Xi^y Xi^y . . .y Xi^ where the indices ij belong to {1, . . . , n} and the 
size k is at most n. It is natural to represent a monotone disjunction d by the 
u-dimensional binary vector indicating which variables are in the disjunction. 
For example, when n = 5 we will specify the disjunction xi V x^ by the binary 
vector (1,0, 1,0,0). Given a monotone disjunction d G {0,1}" and an instance 
X G {0, 1}", the prediction of d on a; is defined to be the boolean value d{x) = 1 
if d • a; > 1 and 0 otherwise. 

Good learning algorithms in the mistake bound model make a number of mis- 
takes not much larger than twic^ the number of mistakes made by the best 
disjunction on an arbitrary sequence of examples. This can be easily achieved 
for direct algorithms, such as direct WM. No known indirect algorithms achieve 
this goal. In fact, for indirect algorithms it is only possible to prove relative 
mistake bounds that are not much larger than twicqj the number of attribute 
errors of the best disjunction. A disjunction’s attribute errors are those bits in 
the instances that must be changed so that the disjunction correctly labels the 
modified instances. For disjunctions of size k, the number of attribute errors can 
be up to a factor of k larger than the number of classification errors. These addi- 
tional mistakes appear to be a necessary consequence of the indirect algorithm’s 
improved computational efficiency. This penalty occurs only in the presence of 
noise; in the noise-free case both direct and indirect algorithms have similar 
O(fclogn) mistake bounds (see lUt^ L 

3 The Direct Bayes Algorithm for Disjunctions 

It is straightforward to apply Bayes methods (see, e.g. to the on-line 

learning of disjunctions in the presence of noise. For instance, we might as- 
sume that the unknown sequence is generated as follows. First, a “target” dis- 
junction d is chosen at random from some prior distribution P(- | A) on the 
space of all monotone disjunctions over n variables, where A denotes the emp- 
ty sequence. Second, each instance-label pair (xt,yt) of the sequence = 
(xi,yi), . . . , {x^,yi) is drawn at random according to some probability distri- 
bution P{-\d) such that P{{xt,yt) \ S*~^,d) — Piyt \ Xt,d)P{xt \ d) where 
P{xt I d) = P{xt) and P{yt \ Xt,d) = lyt-d(“=t)l (l _ j,) (i-lyt-d(xDI), In 

® The factor of two disappears when a probabilistic prediction is allowed, so that the 
direct algorithm’s expected (w.r.t. its internal randomization) number of mistakes 
should not be much larger than the number of mistakes made by the best disjunc- 
tion UMl- 

Again, the factor of two multiplying the number of attribute errors disappears when 
a probabilistic prediction is allowed mm- 
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other words, each label yi, .. .ye of the sequence of examples differs from 
that predicted by the selected disjunction with a probability that depends on an 
arbitrary but fixed “noise rate” v G (0, 1/2). 

In this probabilistic setting, it is not too difficult to see that the Bayes prediction 
rule simply outputs the label ijt such that 



At the end of every trial the current posterior distribution over the class of 
monotone disjunctions is then updated according to Bayes rule. 

Different choices of the noise rate v produce different versions of the Bayes 
optimal predictor O- For instance, if /3 < 1 and v = f3/{(3 + 1), then the Bayes 
prediction algorithm is identical (up to a trivial rescaling of the weights) to the 
direct WM algorithm that always updates with factor /3. 

4 A Technique for Deriving Indirect Algorithms 

In this section we present a general technique for deriving indirect prediction 
algorithms for learning disjunctions. In particular we show that when some in- 
dependence assumptions are made regarding the generation of the instances and 
labels, then the posterior distribution over disjunctions kept by the direct Bayes 
algorithm can be encoded with only 0(n) weights. By appropriately fixing the 
unknown parameters of the model we obtain simple update rules for the 0(n) 
weights encoding the posterior. To simplify the presentation, we specialize our 
technique for the case where the update function is like the one used by By- 
lander’s BEG algorithm We also show that when this update function 

is combined with the Bayes prediction rule, then the resulting mistake-driven 
indirect algorithms do provably well in the adversarial noise free setting when 
learning disjunctions. 

It is not easy to encode the Bayes posterior over disjunctions with only n weights. 
Our approach uses an expanded label space where each variable has its own label 
bit. This vector-label prediction problem enables us to sidestep the normalization 
constant that would otherwise appear when the successive posterior distributions 
are computed, allowing an easy factorization of the posteriors. Combining this 
expansion with a natural loss function yields Bayes algorithms that predict the 
bit 1 whenever the posterior probability of the all-1 label vector 1" is greater 
than the posterior probability of the all-0 vector 0". Thus these Bayes algorithms 
for the vector-label problem can be used to solve the original disjunction problem 
by simply converting the binary labels into the 1" or 0" vector-labels. 

So far we have been unable to obtain interesting algorithms without going 
through this vector-label problem. Neither considering the label as a stochastic 
function of the attributes, nor considering the attributes as corrupted versions 
of the label seemed to work. In the first case we were unable to decompose the 



yt = arg max 
ye{o.i} 
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problem because of a normalizing factor depending on all of the components. Al- 
though the posteriors factored in the second case, the resulting algorithms were 
not Winnow-like, and we were unable to prove relative loss bounds for them. 

4.1 The Bayesian Framework 

In this section we consider a vector-label prediction problem where the sequence 
of examples = (xi,yi ), . . . , {xg, yi) consists of instances Xt = ■ ■ ■ Xt^n) G 

{0, 1}" and vector-labels yt = ■ ■ ■ yt,n) G {0, 1}”. We will use a natural loss 

function between Booleans and vector-labels so that the predictions made by 
the algorithm are the Boolean predictions required for the disjunction problem. 
In Section 0 we assumed that the unknown sequence = (a;i, j/i), . . . , (a:^, yg) 
is generated by first selecting a “target” disjunction d according to some prior 
distribution P{- \ A) over the class of all monotone disjunctions and then by 
drawing each instance-label pair (xt,yt) of the sequence at random accord- 
ing to some probability distribution P{-\d) on {0, 1}" x {0, 1}”. However, here 
we assume that the probability distributions of the model satisfy the following 
assumptions. 

Model Ai 



The assumptions of “model A4” are designed so that the posterior probabilities 
over disjunctions have the following product form. 

Lemma 1. Under model A4, for any sequence S* we have that 



where S* = (xiy,?/i,i), . . . {xt,i,yt,i)- 

Proof. The proof is by induction on t. If t = 0 then S* = \ and the thesis holds 
by ASl. Assume that P{d \ S*~^) = Y\^^iP{di \ We now show that 

the decomposition also holds for 5*. Using Bayes Rule and assumptions ASl 
through AS4 it is not difficult to see that 



ASl p(d I A) = nr=i I A). 

AS2 P{{xt, yt) I S*-\d) = P{yt \xt,d) P{xt \ d) 

ASS P{xt I d) = P{xt) 

AS4 P{yt I xt,d) = I ^t,d) = P{yt,i \ Xt,^,d^) 



n 



P{d\S*) = l[P{d,\Sl) 



( 2 ) 



P{d\S*-\{xt,yt)) 



P(d\S*-^) P{{xt,yt)\S*-\d) 



Pii^uyt) I 5*-i,d') P{d' I 5*-i) 
P{d\S*-^) P{yt\xt,d) P{xt\d) 



Ed'eio.i}" PiVt I P{xt I d') P{d' I 5*-i) 




ntlPid^\Sl-^) P{yt,r\xt,^,d.) 
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where the first equality follows from Bayes Rule, the second equality from as- 
sumptions AS2 and the third equality follows from assumptions ASS, AS4 
and the inductive hypothesis. Now, since in the denominator of we are sum- 
ming over all d' S {0, 1}", sum and product can be switched and thus Q can 
equivalently be written as 



P{d\S*-\{xt,yt)) 



A / Pjdz I Sl P{yt,i I xt,i,dj) P{xt,i I d») \ 

t\ V^d'e{0,i} Piyt,^ I P{xt,. \ d[) P{d[ \ Sl~^) J 

A PjS*-^ I d,) P((xt,„ yt,,) I d,) P{d, I A) 
t\ Sd'efo.i} PiSt" I dd Piixt,^,yt,^) I <) P{d' I A) ’ 



where in the first equality we have used assumption ASS and in the second 
equality we have applied Bayes Rule to P{di \ Sl~^). Finally, observing that 
P{i.xt,i,yt,i) I di) = P{{xt,i,yt,i) \ Sl~^,di) and that P{Sl~^ \ di) P{{xt,i,yt,i) \ 
Sl~^,di) = P{Sl~^{xt^i,yt,i) I di) we obtain by simple manipulations 



P{d\S*-\{xt,yt)) 



A P{Sl-\xt,^,yt,^)\d,)P{d,\\) 
t\ P{St\xt,i,yt,.) I d[) P{d[ I A) 



^ P{Sl~\xt^,,yt^i),d,) 



l[P{d,\Sl-\xt,„yt,^)). 

i=l 



This concludes the proof. □ 

Thus maintaining the posterior P(d \ 5*) reduces to maintaining the n indepen- 
dent posteriors P{dt \ Sl), each of which can be encoded with a single weight. 
Before further proceeding in the analysis of our Bayesian framework it is impor- 
tant to point out an important difference between our set of assumptions and 
the ones used by the popular Naive Bayes algorithm. 

Both methodologies make some simplifying assumptions regarding the genera- 
tion of the instances and labels that allow them to use only 0(n) time per trial 
when learning disjunctions. Naive Bayes simply assumes that the attribute val- 
ues are conditionally independent given the label, i.e. for any instance Xt and 
label yt, 

n 

P{xt I yt) = Y[p{xt,i I yt)- (4) 

i=l 

However, Naive Bayes makes no use of the fact that the examples are generated 
by some target disjunctions. Our model allows the algorithm to track the poste- 
rior probabilities of the various disjunctions. Despite the simple assumption 0), 
Domingos and Pazzani piPflBj show that if the instances are drawn uniformly 
at random, then Naive Bayes is optimal for learning disjunctions in the average 
case setting. However, experimental results reported by Littlestone |rdt95j show 
that Naive Bayes is not optimal in the relative mistake bound setting even when 
it is run in a conservative way. 
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An Update Rule for the Posterior Probabilities We now consider a partic- 
ular family of distributions, {P/3o,/3i,7(y I '^)}a;,i/,d 6 {o,i}) for which the weights 
encoding the Bayes posteriors are easily updated. This family has the noise pa- 
rameters 0<7<1,0</3o<1, and /3i > 1, and is defined as follows: 



P^o,Pi,iiy \x,d) = 




1 - 2 / 




\i X = 1 






otherwise 



Parameter 7 is the probability that the label is Hipped to 1 when a; = 0, and 
the (3 parameters jointly encode different noise probabilities for the case when 
a; = 1, d = 1, and the case when a; = 1, d = 0. 

After seeing a new example, the weights encoding the posterior are updated as 
in Bylander’s BEG algorithm. 

Theorem 1. Let S be the sequenee of examples through trial t and let {x,y) 
be the example reeeived at trial t -|- 1. If for eaeh i = l,...,n, the probability 
P{yi I Xi,di) is equal to P/3o,/3i,7(?/* I Xt,d^) and P{di \ Si) = wf'{l - 
then in model A4 

n n 

P{d\S) = l[wf'{l-w,)^-‘‘', P{d\S,{x,y)) = l[wf'{l-m)^-‘‘' (5) 



where 






l-Wi + Wi{!3yf)^ 



( 6 ) 



Proof Sketch. By using assumption ASS, a case analysis shows that for the 
distribution P{yi \ Xi,di) assumed in the theorem, the Bayes rule for computing 
successive posterior probabilities for the components reduces to equa- 

tion 0 . In fact, a more tedious analysis can be used to show that the distribu- 
tion Ppo.hmiyi I ^iidi) is the only distribution (under our four assumptions) for 
which identity 0 holds. The theorem then follows by combining this Bayesian 
single component update rule with the product decomposition (EJ. □ 

Notice that update rule 0 is independent of the parameter 7 that specifies 
the distribution This is because when a: = 0, the disjunction cannot 

evaluate to 1, and the probability that y = 1 is the (unknown) noise rate. This 
value can be set to any value 0 < 7 < 1 without affecting the update rule. 



A Bayes Predictor for Bit-labels The next step is to map the posterior 
distribution m and the current instance into a prediction. Since the disjunction 
problem requires single bit predictions (rather than vector-labels), we define a 
natural loss function between vector-labels and bit-labels that is 1 if and only 
if some component of the vector-label differs from the bit-label, i.e. for yt G 
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{0, 1}" and y G {0, 1}, the function L'^{yt,y) = 1 if 3j such that ytj ^ y, and 
L'^{yt,y) = 0 otherwise. 

The Bayes optimal algorithm for this loss and probabilistic model Ad predicts 
the binary label y that minimizes the expected loss, i.e. 

j/t = arg niin ^ L'^{yt,y)P{yt \ Xt, (7) 

^ ytG{o,i}" 

It is not difficult to see that this simplifies to 

ijt = arg max P{y'^ \ (8) 

yG{o,i} 



where y” is the n-dimensional vector with each component set to y. Thus the 
Bayes optimal prediction is the bit y for which the corresponding vector y" is 
more likely. 

Prediction JHI) can be expressed in a dot product form over a transformed weight 
space as shown by the following result. 

Theorem 2. Let S = (xi,yi), . . . , (xt-i,yt-i) be the sequence of examples 
observed before trial t and let Xt be the instance received at the beginning of trial 
t. If P{y I Xt^i,di) is some P/3o,0in(y I Xt0,di), then under model A4 decision 
rule (0) can be expressed in the following form: 



yt 



1 if Xt ■ Zt > 0 

0 otherwise 



where 9 = nln(7/(l — 7)) and Zt,i = In 



7(1 -/?o) I + wt,i{/3i - 1) 



,(1 - 7)(/?i - 1) 1 + Wt,i{Po - 1) 
Proof Sketch. Under model Ai it is not difficult to see that 

I s) = n I 5] P(yM I Xt,., df)P{d. I I ■ (9) 

1 ‘ I i=i yd-g{o,i} j 



Substituting the expression for P{yt \ Xt,S) given in ® in the Bayes Decision 
rule (0 we obtain 



n 



Vf = arff max 
ye{o,i} 



n 



P{y \ Xt,i,di)p{d. \ 

diGfO.l} 




( 10 ) 



The thesis then follows, by simple manipulations, from m 



and the facts 



P(y=l I Xt,i,di)P{d. I Si) = 

diGfO.l} 

P(y = 0 I xt,i,di)P{d. I Si) = 

diSfO.l} 



I — (dp 

J3i — Po 



(1 + Wt,.{Pl 



1 )) 



(1-7)'-"*' 



' Pi- 1 

_Pi — Po 



(1 + Wt,.{Po - 1 )) 



Xt,i 



1 — Xt,; 
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Baj/es-BEG 

Input: 0 < /3o < 1, /3i > 1, 0 < 7 < 1, and n > 1 
Initialization: Let . . . wi,„) be a weight vector in [0, 1]" 

For t = 1,2. . . 

Prediction Rule: Upon receiving the instance Xt, if Xt ■ Zt > 0 then predict 1, 

otherwise predict 0, where 

= In ^ = nln(7/(l - 7))- 



Update Function: Observe vector-label yt and for each i = 1, . . . , n set 



Wt+l,i 



Wt,: 



P 



yt,i 



Wt,i + 



( 11 ) 



Update Policy: Update in all trials. 



Fig. 1. The Bayes-BEG algorithm. 



□ 

We call the indirect algorithm using the prediction rule of Theorem |2| and the 
update function described in Theorem^the Bayes-BEG algorithm. The algo- 
rithm is summarized in Figure^ Its always-update version minimizes the prob- 
ability of a mistake with respect to the discrete loss L" when the vector-labels 
are generated by Pi3o,/3i,-y{y I x,d) as per model Ai. However, when learning 
disjunctions it will only see the vector-labels 1" and 0". 



4.2 The Mistake-driven Bayes-BEG Algorithm 



We now turn from the probabilistic setting to the adversarial setting where we 
analyze MD-Bo^/es-BEG, the mistake-driven version of Boyes-BEG, assuming 
the algorithm only sees the vector-labels 1” and 0" that correspond to the labels 
for the disjunction problem. We use Arg^]_g(B) to denote the number of mistakes 
made by algorithm “alg” on sequence S. 

We start by proving a mistake bound for the MD-Bayes-BEG algorithm when 

/3o = 0. 

Theorem 3. Let n>2,c= ((e-|-l)/(e— and set j = cj (1 + c), f3i = l + c 

and Po = 0. Furthermore, let wi^i = 1/n for i = 1, . . .n. Then for all sequenees 
S = (xi,yi), . . . , {xi, yi) sueh that there exists a monotone disjunction consistent 
with S we have 



^^MD-Bayes-BEoi^) ^ ^.48 -b 2.48fc 



1 + 



Ina 



{ 2(n-l) \ \ 

'v(l + c)(e-l); 



( 12 ) 



where k is the number of relevant variables in the target disjunction. 
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Proof. The bound is proven by following the same approach used in Section 5 of 
Littlestone |Lit88| . As in , we call every trial where the algorithm predicts 

j/t = 0 and yt = 1 & promotion step and every trial where yt = 1 but yt = 0 
an elimination step. Since the MD-Boyes-BEG prediction algorithm is mistake- 
driven, its number of mistakes is equal to the number of promotion steps, p, plus 
the number of elimination steps, d, made by the algorithm. The theorem is then 
proved by bounding the number of promotion and elimination steps. 

To avoid confusion, in the rest of the proof we will call the weights z used in the 
dot product prediction the “z- weights”, and the weights w associated with the 
attributes the “w- weights”. 

Let Zt = ^t,i be the total z-weight at the beginning of trial t. Since for any 

i G {1, . . . n} and t = zt^i > 0 and Zt^i only increases/decreases during 

promotion/elimination steps it follows that Z\ +pZgain — dZiost > 0, where 
Zgain is an upper bound on the total z-weight gained during a promotion step 
and Ziost is a lower bound on the total z-weight lost during an elimination step. 
By solving it with respect to d we obtain that d < (Z\jZiost) + p{Z gain I Ziost) ■ 
Hence, the number of mistakes made by MD-Bayes-BEG can be upper bounded 
by 

^m-Bayes-BEGiS) < + p -f . (13) 

^ ^lost V ^lost J 

We now estimate the quantities in the right hand side of m- For the total 
initial z-weight, Zi = it is easy to see that when c = (e -I- l)/(e — 1) 

and wi^i = 1/n {i = l,...,n), we have Z\ = nln^l-|-^^^ < 2 f3i, where 
the inequality follows from the fact that for r > 0 we have ln(l -1- r) < r and 
nj(n — 1) < 2 for n > 2. Similarly, Ziost > d since during each elimination step 
we have ^t,iZt,i > d and Zt+i^i = 0 for any attribute Xt^i such that Xt^i = 1. 
For Zgain the analysis is more involved. First note that if xt^i = 1 the corre- 
sponding weight wt^i is updated to Wt+i,i = Wt^i/3i/{l — Wt^i + Wt,i/3i). Substi- 
tuting Wt+i^i into Zt+i^i and observing that for wt,i G [0, 1] the ratio zt+i^i/zt,i 
is decreasing with respect to Wt,i and limmj .^o = (1 -I- c), we obtain 

Zt+l^i! Zt^i ^ (1 “t“ c). Now, Zgain — ^^i—l Zt,i) ^ 1 ^t^iZt^i ^ C^, 

where the last inequality follows from the fact that during a promotion step we 
have ^t,iZt,i < d. 

We now bound the number of promotion steps incurred by the algorithm. We 
first observe that if the u>- weight assigned to each relevant attribute is > l/y(c) 
where g{c) = 1 -I- ((1 -I- c)(e — l)/2), then zt,i > nln(7/(l — 7)) and positive 
examples are correctly classified by MD-Bayes-BEG. By simple manipulation it 
is not difficult to see that if Xt,i = 1 and Wt,i = 1 — l/(l-l-(l-l-c)“^) then at the end 
of a promotion step the updated weight Wt+i,i is Wt+17 = 1 — l/(H-(l-|-c)“^+^). 
By expressing the initial and final weights of a relevant attribute in this form, 
we have that the number of promotion steps per relevant attribute can be upper 
bounded by [| Inc+i ((2(n — 1))/((1 -I- c)(e — 1))) |] and thus, 

p < k 



In, 



c+l 



2(n- 1) 
(l-kc)(e- 1) 



<k [1 + 



Ino 



2(n- 1) 
(l-kc)(e- 1) 



(14) 
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Bound lED then immediately follows by plugging the above estimates for Z\^ 
^losti Zi gain and p into lED and by observing that 1 < c < i/(l -I- e)/(e — 1). □ 

The Bayes-BEG update function with /3 q = 0 sets Wt+i^i to 0 whenever xt^i = 1 
and yt = 0, and the multiplicative nature of the update ensures that Wi remains 

0 thereafter. Therefore, if an example has the label 0 but all the variables are 
1, then all of the weights get set to zero and the algorithm is no longer able 
to predict 1. This indicates that the /3 = 0 version of Bayes-BEG is unable to 
tolerate noise. 

On the other hand, if /?o > 0 then the weights will always be positive. Even if the 
weight of a variable in the best disjunction gets driven down by noisy examples, 
the multiplicative update will allow it to recover before the algorithm makes too 
many additional mistakes. Although the exact analysis with noise is difficult, 
the next result shows that noise tolerant versions of the algorithm (with /3 q > 0) 
also have similar noise- free mistake bounds. 

Theorem 4. Let n>2,q= 16/10, e = 9/10 and set 7/ (1 — 7) = /?o = 

1 — — 1 -I- e)/((l -I- and Pi = 1 + — 1 + e)/(l + q)- Then 

for all sequenees S = (a;i, yi), . . . , (a;^, y^) such that there exists a monotone 
disjunction consistent with S we have 



MMD-Ba.jes-BEGiS) < 24.79 + 8.44fcln(n - 1) + 5.76k, (15) 

where k is the number of relevant variables in the target disjunction. 

Proof Sketch. It is similar to the proof of Theorem 0 but now, rather than 
analyzing the change in the total weight Zt = X^r=i ^t,i, where the are the 
weights used in the dot product prediction, we analyze the change in the total 
weight Yt = X;r=i yt,i where yt^^ = ln((l-fwt^j(/3i-l))/(l-wt7(l-/3o)))- Details 
of the proof are omitted. □ 



4.3 The Thresholded-BEG Algorithm 

An indirect algorithm related to Bayes-BEG results when the update function 
of Figure [D is combined with the simple thresholded dot product prediction 
rule used by Winnow. That is, the algorithm rather than predicting with the 
prediction rule of Figure 0 predicts 1 if Xt ■ Wt > 6, and 0 otherwise. We call 
the resulting algorithm Thresholded-BEG. 

This algorithm is much easier to analyze with the existing relative mistake bound 
techniques ^4t89IAW98| . and it is not difficult to get relative mistake bounds for 
it even in the noisy case. For instance, if no information besides the number n 
of attributes is given, then the following bound on the number of mistakes made 
by the algorithm on any sequence of examples where the best disjunctions incurs 
at most A attribute errors can be proven. Recall that the attribute errors of a 
disjunction d are those bits in the Boolean instances that have to be changed so 
that d correctly labels the modified instances. 
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Theorem 5. Let a > 1 and set = a, /?o = l/o and 0 = (a In a)/(a^ — 1). Let 
> 0. Then for all sequences S such that there exists a monotone disjunction 
with at most A attribute errors on S , we haue 



LLThresholded-BEciS) < (o + 1) 



distbr'e('it, iDi) + Aina 
In a 



(16) 



where distbre{u,wi^i) = Yh=i Ui\n{ui/wi^i) + (1 - Ui) ln((l - Ui)/{1 - wi,i)) is 
the binary relative entropy between the target disjunction u and the initial weight 
vector Wi used by the algorithm. 



Proof Sketch. The technique used to derive the mistake bound is similar to 
Auer and Warmuth’s mM- The analysis proceeds by showing that the dis- 
tance between the weight vector Wt used by the algorithm and a target weight 
vector u decreases whenever the algorithm makes a mistakes. However, our anal- 
ysis uses the binary relative entropy as a potential function. Details of the proof 
are omitted. □ 

It is interesting to observe that bound (HE|) of Theorem 0 has the same form 
as the bound derived for Winnow in |AWh8j . except that in the latter the bi- 
nary relative entropy of i|1 (ill is replaced by the un-normalized relative entropy 

disture(u,Wij) = Yh= 1 UiAliUi/ Wi^i) + - Ui- 

Better results can be obtained if the algorithm has some additional information 
regarding the sequence to be predicted. For instance, if the number A of attribute 
errors of the best disjunction is known in advance, then the parameters of the 
algorithm can be optimally tuned to obtain bounds similar to those derived 
in |AWD8) (although with slightly worse constants). For example, in the noise- free 
case, i.e. when the algorithm knows ahead of time that there exists a monotone 
disjunction consistent with S (A = 0), we get a bound that is incomparable to 
Theorem 01 

Corollary 1. Let n > 2 and set /3q = 0, /?i = e and 9 = 1/e. Furthermore, 
let wi^i = 1/n for i = 1, . . . ,n. Then for all sequences S = {xi,yi), . . . , {xi, yi) 
such that there exists a monotone disjunction consistent with S we have 

^Thresholded-BEd^) ^ 3.76 -|- 2.72/An(n), (17) 

where k is the number of relevant variables in the target disjunction. 

5 The Normalized Winnow Algorithm 

The Normalized Winnow algorithm jLit^ is another mistake-driven linear 
threshold algorithm for on-line learning disjunctions with a good relative mis- 
take bound. This algorithm is identical to Winnow except that it normalizes its 
weights before computing the linear threshold prediction. Techniques like those 
employed in Subsection 14. 1 1 show that Normalized Winnow is also closely related 
to Bayesian methods. 
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Whereas the prior on disjunctions used in Subsection lO was a product of n 
Bernoulli distributions, the prior we found useful for Normalized Winnow is the 
n-fold product of a distribution over {1, 2, . . . , n}. Since sampling this prior gives 
a vector in {1, 2, n}", most of the 2" disjunctions will be represented by several 
possible outcomes. For example, the disjunction xi V 0:3 V xj is represented by 
all vectors containing only I’s, 3’s, and 7’s, and at least one of each. 

Under model Ai (with a slight modification to assumption AS4 for the new 
prior) the posterior probabilities over the space {1,2,..., n}" can also be repre- 
sented as an n-fold product distribution. As before, the vector-label problem is 
used to decouple the attributes and obtain this result. It turns out that the poste- 
rior probabilities/ weights of the integers in {1, 2, . . . , n} are updated in the same 
way as the weights of the Normalized Winnow algorithm. With the loss function 
defined in Section WH the Bayes-optimal predictions are the same thresholded 
dot products between the instances and weights used by the Normalized Winnow 
algorithm. This establishes a close correspondence between Normalized Winnow 
and Bayes methods. 

Although the relationship between Normalized Winnow and its corresponding 
Bayes algorithm is analogous to the relationship between indirect Bayes-BEG 
and BEG, there are two subtle differences. Indirect Bayes-BEG uses a logarith- 
mic function of the weights in its dot-product prediction rule, while Normalized 
Winnow and its corresponding Bayes algorithm both predict with the simple 
thresholded dot-product between the weights (or probabilities) and the instance. 
However, the predictions of the Bayes algorithm remain a simple thresholded 
dot-product only as long as the vector-labels are either 1" or 0". Vector-labels 
containing both Is and Os break the symmetry and the Bayes optimal prediction 
is no longer a dot product. 

The technical details for relating Normalized Winnow to Bayesian methods are 
more complex than for the new classification variant of BEG, but the basic ap- 
proach is the same. It is now natural to ask if the original (un-normalized) Win- 
now algorithm also has a corresponding Bayesian interpretation. Our attempts 
in this direction have been unsuccessful. Using Poisson distributions to encode 
the prior and posteriors over disjunctions appears promising since it correspond- 
s to the un-normalized relative entropy used to analyze Winnow jl jitHhIAWhHj . 
However, Winnow’s weights do not seem to encode the proper Poisson posteriors. 



6 Conclusions 

The Winnow family of algorithms is surprisingly good at learning disjunctions 
in the relative mistake bound model. These algorithms are very efficient, using 
only 0{n) weights. The goal of this research is to gain a better understanding 
of this family by exploring its relationship to Bayesian methods. Although we 
have not yet answered this question for Winnow itself, we do have a Bayesian 
interpretation for the prediction and update rules used by Normalized Winnow 
and a new classification variant of BEG. 
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We started by investigating the assumptions necessary to encode the posteriors 
over monotone disjunctions kept by Bayes algorithms with only 0(n) weights. 
Our methods lead to computationally efficient algorithms which are motivated 
by a Bayesian analysis. For one of these algorithms, indirect Bayes-BF,G, we have 
examined how its mistake-driven variant performs in the relative mistake bound 
model when learning disjunctions. In the noise free case we have shown that this 
variant has mistake bounds with the same form as the best known indirect al- 
gorithms for learning disjunctions. Further results imply that this algorithm can 
tolerate noise, but the complexity of its predictions makes the analysis difficult. 
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Abstract. We consider algorithms for combining advice from a set of 
experts. In each trial, the algorithm receives the predictions of the ex- 
perts and produces its own prediction. A loss function is applied to mea- 
sure the discrepancy between the predictions and actual observations. 
The algorithm keeps a weight for each expert. At each trial the weights 
are first used to help produce the prediction and then updated accord- 
ing to the observed outcome. Our starting point is Vovk’s Aggregating 
Algorithm, in which the weights have a simple form: the weight of an 
expert decreases exponentially as a function of the loss incurred by the 
expert. The prediction of the Aggregating Algorithm is typically a non- 
linear function of the weights and the experts’ predictions. We analyze 
here a simplified algorithm in which the weights are as in the original 
Aggregating Algorithm, but the prediction is simply the weighted aver- 
age of the experts’ predictions. We show that for a large class of loss 
functions, even with the simplified prediction rule the additional loss of 
the algorithm over the loss of the best expert is at most clnn, where 
n is the number of experts and c a constant that depends on the loss 
function. Thus, the bound is of the same form as the known bounds for 
the Aggregating Algorithm, although the constants here are not quite as 
good. We use relative entropy to rewrite the bounds in a stronger form 
and to motivate the update. 



1 Introduction 

The focus of this paper is a certain class of on-line learning algorithms. In on-line 
learning the algorithm receives one by one a sequence of inputs Xt and makes 
after each xt a prediction yt- For each input Xt there is also a corresponding 
outcome (or desired output) yt which is revealed to the learner after it has made 
its prediction yt- 

To define our on-line learning problem more closely, we need to specify which 
sequences ((xi, j/i), . . . , {xi, ye)) are allowed as inputs, and what is the criterion 
for judging the quality of the predictions yt- Regarding the input sequences, we 
take a worst-case view that given some domain X for the inputs and Y for the 
outcomes, for each t the pair (xt, yt) can be any element oi X xY. In particular, 
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the pairs need not come from any probability distribution, and we make no 
assumptions about possible dependence between yt and Xt. In this paper we 
consider mainly the case X = [0, 1]" for some n and Y = [0, 1]. Many of the 
results have obvious extensions to larger ranges of real inputs and outputs. We 
sometimes also consider the special case Y = {0,1} where the outputs (but not 
the inputs) are required to be discrete. 

To judge the quality of the predictions, we first introduce a loss function L that 
gives a (nonnegative) quantity L{yt, fft) as a, measure of discrepancy between the 
prediction and actual outcome. The square loss given by L(y,y) = {y — yY is a 
good example of a loss function suitable for our setting. 

In addition to the loss function, it is essential to give a comparison class T of pre- 
dictors as a reference point . The predictors are mappings from the set of possible 
inputs X to the set of possible predictions. We then define the total loss for an al- 
gorithm A that gives the predictions yt on a sequence S = {{xi,yi), . . . , {xi, ye)) 
as Lossa(<S') = yt)t and similarly for a predictor f G X as Loss/(5') = 

L{yt, f{xt)). We can measure the performance of our prediction algorithm 
by considering the additional loss Lossa(S') — inf/g;rLoss/(S') it incurs compared 
to the best fixed predictor from the comparison class. We call such performance 
bounds relative loss hounds. 

In the extreme case that the outcomes yt are completely random, the algorithm 
obviously cannot perform better than random guessing, but then neither can the 
predictors from the comparison class, so the additional loss can still be made 
small. In the more interesting extreme case that one predictor f G T \s perfect 
and we have L{yt, f{xt)) = 0 for all t, the algorithm can still be allowed some 
initial interval of bad predictions, but to achieve a small additional loss it needs 
to quickly learn to make good predictions. Usually we are somewhere between 
these to extremes. Some predictors from the comparison class predict better than 
others, and the algorithm is required to perform roughly as well as the better 
ones. 

In this paper the comparison classes we use come from the framework of predict- 
ing with expert advice IVovlK)l(TjTiF^ . We assume there are n experts, and 
the prediction of the ith expert for the tth outcome is given by Xt^i £ [0, 1]. The 
vector Xt of all the experts’ predictions at trial t is then the tth input vector to 
our algorithm. Hence, if we define £i(x) = xi, then LosSf^(S') denotes the loss 
that the expert £i would incur on the sequence S. The obvious thing to do now 
is to take as comparison class the set { 5i, . . . , } of expert predictors and thus 

compare the loss of the algorithm to the loss min^ LosSf^(S') of the best single 
expert. 

Earlier work on the expert framework by Vovk fV^ has shown that for a 
very general class of loss functions his Aggregating Algorithm (AA) achieves the 
bound 

Lossaa(5') < Lossf, (S') -I- c^lnn for all z (1) 

where the constant cl depends on the loss function. For example, with the 
square loss we have cl = 1/2. This bound has also been shown to be essen- 
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tially optimal [H K W98| . (Notice that for the important special case of absolute 
loss L{y,y) = \y — y\, only bounds of a somewhat weaker form are possible 
|l ;W h4IVovhni( HFH+ 97| . 1 Vovk’s Aggregating Algorithm is based on maintain- 
ing for each expert a weight that is decreased exponentially as the expert incurs 
loss. The predictions of the algorithm are of course affected more by the experts 
with large weights than by those with small weights, but the actual method 
of obtaining the prediction is somewhat more complicated than just taking a 
weighted average of the experts’ predictions. 

The main technical novelty in this paper is considering what happens if we keep 
using Vovk’s algorithm for maintaining the weights but replace the prediction 
simply by the weighted average of the experts. Considering the optimality of 
Vovk’s algorithm, we cannot hope to outperform it, but it turns out that for the 
simplified Weighted Average Algorithm (WAA) we can still prove the bound 



Losswaa(5') < Loss£^(S') -|- Cilnn for all i 



(2) 



where cl is a constant somewhat greater than cl in (^. For example, with the 
square loss we have cl = 2 and Cl = 1/2. 

The main reason why we want to consider the simplified prediction at the cost of 
slightly larger additional loss is that the simplified algorithm leads to simplified 
proofs of the relative loss bounds. Another intuitively appealing aspect of the 
weighted average as prediction is its probabilistic interpretation. If the negated 
loss —L{yt,Xt^i) can be interpreted as the log likelihood of yt given model £i, 
then the weight of the expert £i after the trials can be interpreted as the pos- 
terior probability assigned to that expert. The prior probabilities here are the 
initial weights of the experts. In this setting, the prediction by weighted aver- 
age correponds to the mean posterior prediction. The log loss, for which the log 
likelihood interpretation is most obvious, has been analyzed in this context be- 
fore |Vov9(IChiI’H+97b'^^W5/| . It turns out that in the special case of log loss, 
the prediction of the Aggregating Algorithm also is the weighted average, so the 
Weighted Average Algorithm coincides with the original Aggregating Algorithm. 
In reducing the algorithm’s dependence on the particular loss function, the next 
step would be Freund and Schapire’s Hedge Algorithm that needs to 

assume only that the loss function has a bounded range. They can still prove 
loss bounds of the same flavor as the bounds here, but in the slightly weaker 
form of 



LossHedge(5') < Lossf^ (S') -|- o-\/Loss£:^ (S)) In n 4- blnn for all i 

for certain a,b > 0. Hence, there is a progression of algorithms where Vovk’s 
original Aggregating Algorithm has a weight update that is uniform for all kinds 
of loss functions, but the prediction method is dependent on L. For the Weighted 
Average Algorithm, the prediction is made by the weighted average regardless of 
the loss function, but this happens at the cost of slightly worse constants in the 
loss bounds. Finally, the Hedge Algorithm is even more uniform in its treatment 
of loss functions, but the loss bounds get worse by more than just a constant. 
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(Also notice that the bound for the Hedge Algorithm does not work with the 
unbounded log loss.) 

After the technical remarks, consider now relating these results to a larger 
body of work where the relative entropy is the fundamental concept for mo- 
tivating and analyzing learning algorithms fKW97| . Let u G R” and v G R" 
be probability vectors; i.e., ~ Si ^ Ui,Vi > 0 for all i. The 

relative entropy between u and v is then d,;e{u,v) — rti ln(ui/r)i). To in- 

troduce relative entropy methods into the present problem, it is useful to start 
by considering a slightly extended comparison class. We define Loss^''®(«S') = 
Xt,i) to be the expected loss if we predict by a random ex- 
pert chosen according to u. We first rewrite Vovk’s original proof in order to 
bring out how the additional loss incurred by the algorithm relates to a relative 
entropy. The resulting bound is 



LoSSwAa)*?) < LoSS^''®(S') -I- CL dre(u, Vi) 



(3) 



where Vi is the algorithm’s initial weigth vector. With Vi — (1/n, . . . , 1/n), 
and Ui = 1 and uj = 0 for j ^ i, this simplifies to bound © where compar- 
ison is against the single best expert £i. Note that since always Loss^''®(S') > 
mini LosSf:^(S'), going from (|21) to m does not bring any improvement in the first 
term of the bound. However, improvement in the second term are possible. If 
there are several expert with nearly optimal performance, then substituting into 
a comparison vector u that distributes the weight nearly evenly among the 
good experts gives a significantly sharper bound than &■ As a simple example, 
assume that k experts all have some small loss Q. Then 0 gives the loss bound 
Q -I- Ci In n while the bound 0 goes down to Q + CLln{n/k). The new method 
brings out in a more explicit form the feature implicit in earlier proofs (see, 
e.g., [r W 94pVov9fl] ) that having more than one good expert results in a smaller 
additional loss. For log loss this feature, with bounds of the form OSJ and proofs 
analogous to ours, was already pointed out in p^SSW97] . 

Our second use for relative entropy is as a regularizing term in setting up a 
minimization problem that gives Vovk’s rule for updating the weights. The basic 
idea in such a derivation (see [KW97IHKW95j for other examples) is to see the 
update as an act of balancing the need to maintain old information by staying 
close to the old weight vector and the need to learn by moving the weights in 
the direction of small loss on the last example. 

In Sect. 0 we review the basic expert framework and Vovk’s algorithm. Sect. 0 
gives the new upper bound for the additional loss achieved by the modified algo- 
rithm that predicts with the weighted combination of experts. A straightforward 
proof is given in Sect.0 In Sect. 0 we restate the bound and proof using a relative 
entropy, and give a motivation for the algorithm in terms of a relative entropy 
minimization problem. Finally, in Sect. Elwe generalize the relative loss bounds 
for the new algorithm to multi-dimensional predictions and outcomes. 
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2 The Setting and the Algorithm 



We consider a simple on-line prediction setting, where learning takes place during 
a sequence of trials . At trial t, the learner tries to predict a real- valued outcome 
yt- The learner’s prediction is denoted by yt, and the performance of the learner 
is measured by using a loss function L. Loss functions will be discussed in more 
detail in Sect.0 but for understanding the algorithm it is sufficient to think of, 
say, the square loss given by L(y, y) = (y — y)'^ . The learner bases its prediction 
{ft on an instance Xf. In the expert-based framework we use here, we imagine 
there is a set of experts £i, i = 1, . . . ,n, and the instance Xt is an n-dimensional 
vector where the ith component xt^i of the tth instance can be interpreted as 
the prediction given by expert £i for the outcome yt- 

We consider here a specific kind of algorithm based on maintaining a weight 
on each expert. The weight vector Vt is normalized to be a prohahility vector 
(i.e., = 1, 'Ci > 0), and Vt,i can be interpreted as the algorithm’s belief 

in the expert £t having the best prediction at the trial t. The prediction of the 
algorithm at trial t is given by the weighted average if = Vt ■ Xt- After seeing 
the outcome yt, the algorithm updates its weights. The update method and all 
other details of the Weighted Average Algorithm (WAA) we consider here are 
given in Figure Q 

Sometimes it is more convenient to express the update in terms of the unnor- 
malized weights 



Wt,i 



t-i 



W\ 



yexp — '^L{yj,Xj^i) 



i=i 



(4) 



where w\^i = v\^i. Now vt^i = Wt,ijWt where Wt = Xr=i '^t,i is the normalization 
factor. Thus, ignoring the normalization factor, the logarithm of the weight of 
an expert is proportional to the expert’s accumulated loss from preceding trials. 
We call this the loss update to emphasize that only the values of the loss function 
(and not its gradient etc.) are used. 

The loss update of the Weighted Average Algorithm was introduced by Vovk 
fV^ in his Aggregating Algorithm (AA) that generalized the Weighted Ma- 
jority algorithm However, the prediction of the Aggregating Algorithm 

is usually given by a function that is non-linear in Vt and depends on the loss 
function. In contrast, we use the fixed prediction function yt = Vt ■ Xt for all 
loss functions. (A notable special case is the log loss, for which the Aggregating 
Algorithm also predicts with yt = Vt ■ Xt.) 



3 Basic Loss Bounds 

We begin with a short discussion of some basic properties of loss functions. The 
definitions of the loss functions most interesting to us are given in Table E For a 
loss function L, we define Ly{y) = L{y,y) for convenience in writing derivatives 
with respect to y. Note that with the exception of the absolute loss, all the loss 
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Initialize the weights to some probability vector vi^i; 

set the parameter c to some positive value. 

Repeat for t = 1, . . . 

1. Receive the instance xt- 

2. Output the prediction = Vt ■ Xt- 

3. Receive the ontcome yt- 

4. Update the weights by the loss update defined as follows: 

vt+i,i = Vt,iexp{-L{yt,Xt,i)/c)/noTmt 

where 

n 

norm* = ^ vt,i exp(-L(t/t, Xt,i)/c) . 

i=l 

Fig. 1. The Weighted Average Algorithm (WAA) for combining expert predic- 
tions 



functions given in Table Q] are convex, i.e., L”{x) > 0 for all x and y, and also 
satisfy L'y{y) — 0 for 0 < y < 1. This implies monotonicity, i.e., Ly{x) < 0 for 
X < y and L'y{x) > 0 for a; > y. We generalize the derivative notation also for 
the end points by defining Lq(0) = Ti(l) = 0- The absolute loss L{y,y) = \y — y\ 
(and other loss functions that are not continuously differentiable) is not covered 
by the bounds given in this paper. 

Given some fixed loss function L, consider now the total loss 

e 

Lossa{S) ='^L{yt,yt) 

t=i 

suffered by some algorithm A on the trial sequence with the instance-outcome 
pairs S = ({xi,yi ), . . . , {x£,y£)). We wish to prove upper bounds for this total 
loss without making statistical or other assumptions about how the instances 
and outcomes are generated. When no such assumptions are made, one suitable 
way of measuring the quality of the learner’s predictions is to compare it against 
the losses incurred by the individual experts on the same sequence. Thus, we 
also define Lossf,(S') = X)Li 

Consider first the known bounds for the Aggregating Algorithm, which uses the 
same weights Vt as the algorithm of Figure G] but a different prediction yt- To 
state the optimal constants in the bounds, and the learning rates that lead to 
them, define for z,p,q G [0,1] (where z should be interpreted as a “prediction” 
and p and q as two possible “outcomes” ) the ratio 

^ _ L'y{z)L'y{zf - L'y{z)L'y{zf 
Uy{z)U^{z)-L'y{z)L';{z) ’ 

we define R{z,p, g) = 0 in the special case p = q. Let further 

CL = sup R{z,p,q) . 

0<z,p,q<l 
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Table 1. Some common loss functions for the domain [0,1] x [0, 1] 



loss function L 


value for L(y, y) 


square loss 
relative entropy loss 

Hellinger loss 

absolute loss 


{y^ 

(1 - y) ln((l - y)/{l - y)) + yln{y/^ 

1 (^(vi-1/- Vi-y) +{vy-Vv)^ 

\y-^ 



The bound for the Aggregating Algorithm originally given by Vovk jVov90) can 
now be stated as follows. 



Theorem 1. Let L be a convex monotone twice differentiable loss function and 
AA be the Aggregating Algorithm with c> cl and initial weights w\^i = 1. Then 
for any sequence S = , {xi, yi)) we have 

Lossaa(«5) < ^minLoss£:^(S')^ +clnn . (5) 



The Aggregating Algorithm was also considered by Haussler et al. 
who showed the bound © optimal in the sense that under some reasonable 
regularity conditions, for any on-line algorithm A there are sequences S such 
that 

Lossa(<S') > ^minLoss£^(S')^ -I- cl Inn — o(l) , 

where o(l) approaches 0 as n and i approach oo in a suitable manner. 

Vovk and Haussler et al. were mainly interested in the binary case yt S {0,1} 
and actually state © only for that case in the form 



Lossaa(« 5) < ^minLoss£:^(S')^ +CL,binlnn 



( 6 ) 



where cl, hin = sup^ R{z, 0, 1). The actual proof of TheoremEis a simple general- 
ization of the earlier proofs pVov91)IHK W98j for we omit it here. Haussler et 
al. also use some special techniques to show that for certain loss functions such 
as the square loss and the relative entropy loss the bound (0 holds even when 
yt is allowed to range over the whole interval [0, 1]. (The value of the constant 
for Hellinger loss for continuous- valued outcomes was left open in pKWQSj .l 
The new formulation of Theorem Q gives a unified method of obtaining bounds 
in the continuous-valued case. For square, relative entropy, and Hellinger loss a 
straightforward proof (omitted) shows that we actually have cl = cl, bin, so the 
bound is the same for continuous-valued and binary outcomes. 

The main content of the bound m is that even for a large number of experts, 
the loss of the algorithm exceeds the loss of the best expert only by a small 
additive constant, regardless of the number of trials. Thus, the algorithm is 
good at weeding out the bad experts and then following the good ones. We can 
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Table 2. Comparison of the constants in bounds m and o for various loss 
functions. 



loss function L 


Cl 


Cl 


relative entropy 


1 


1 


square 


1/2 


2 


Hellinger 


« 0.71 


1 



prove a similar bound for the Weighted Average Algorithm that predicts with 
yt = VfXt. Define 



R{z,p) 




(7) 



and 



Cl = sup R{z,p) . 

0<z,p<l 



( 8 ) 



We can now state the bound for WAA as follows. 



Theorem 2. Let L he a monotone eonvex twiee differentiable loss function and 
WAA be the Weighted Average Algorithm of Figure^ with uniform initial weights 
wi^i = 1 and with c > cl- Then for any sequence S = . . . , (x£,yi)) we 

have 

Losswaa(< 5') < ^mmLoss£;(S')^ +clnn . (9) 

A generalization for multi-dimensional predictions and outcomes is given in Sec- 

t. El 

To compare the constants cl and cl in and respectively, recall that (a-|- 
b)/{a' + b') < max{ a/a', h/b' } for any a, a', b, b' > 0. From this it is immediate 
that Cl < Cl. For the most usual cases <nj is strictly worse than (jSI), as can be 
seen from the comparison in Table El For the relative entropy loss the bonds are 
actually equal, which is no surprise since then also the algorithms are the same 
(i.e., the Aggregating Algorithm also predicts with xft = Vt ■ Xt). 



4 The Basic Upper Bound Proof 

We apply to our situation the potential function method commonly used in 
computer science to analyze on-line algorithms. Thus, we introduce a potential 
P, with the value Pt describing the algorithm’s state just prior to trial t. Then 
Pt — Pt+i is the decrease in the potential due to trial t. The key in proving the 
loss bound for an algorithm A is to show for each trial t that the prediction yt 
of A satisfies 

L{yt,yt) < Pt - Pt+i , ( 10 ) 

from which summing over t = yields Lossa(>5) < P\— Pt+i- That is, the 

total loss of the algorithm is bounded by the total decrease in potential. The 
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basic question now is, how to choose the potential P such that the equation (EU 
can be satisfied by a suitable choice of the prediction yt, and the total increase of 
the potential gives interesting loss bounds. This question was originally answered 
for general loss functions by Vovk mu who generalized the potential used in 
tnwMi for the absolute loss. We shall next review Vovk’s method for obtaining 
total loss bounds from O using our notation and then show how E3) can be 
achieved by the prediction yt = Vt ■ Xt with slightly worse constants than with 
Vovk’s original prediction. 

First, recall from Sect. |2| that our algorithm has at trial t an n-dimensional 
weight vector Wt defined in and we write Wt = As our potential 

we now choose 

Pt=clnWt (11) 

where c > 0 is the same constant that is used in the updates. As it turns 
out, multiplying the weights by a constant affects neither the algorithm nor our 
analysis of it. Regarding the potentials in particular, multiplying the weights by 
a positive constant a translates into adding the constant c In a to the potential, 
which leaves potential differences unaffected. Thus, without loss of generality we 
can scale the initial weights so that Wi = 1 holds, and Pi = 0. 

Elaborating further on our loss bound we get 

Lossa(S') < Pi - Pe+i 

n 

= — cln wi^i exp(— L oss£^(5')/c) 

i=l 

< — c In w;i,iexp(— Lossf; (S')/c) 

= Lossf. (S') — clnwi_i 

for any given expert i. In particular, in the absence of any other preference it 
seems natural to set all the initial weights equal, which gives w\^i = Ijn for all 
i and thus results in the final bound 

Lossa(S) < ^minLoss£:;(S)^ +clnn . (12) 

To prove Theorem^ it thus remains to show that (cnj is satisfied for the Weight- 
ed Average Algorithm. This turns out to be true for all yt and Xt exactly when 
the constant c satisfies the condition of the theorem. 

To prove m, first write the potential difference in the form 

Pt - Pt+i = -cln = -cln'^vt,^exp{-L{yt,Xt,i)/c) 

1 — 1 

where vt^i = wt^i/Wt is the normalized ith weight. We use the normalized weight 
vector in the prediction by choosing — vt • Xt. Then m becomes 

n 

L{yt,vt - Xt) < -ch\'^Vt,ie^]y{-L{yt,Xt,i)/c) , 

2=1 
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or equivalently 



n 

exp{-L{yt,Vf Xt)/c) >'^Vt,iexp{-L{yt,Xt,i)/c) . 

2=1 



If we define fy{x) = exp{— L{y , x) / c) , flTl) therefore is equivalent with 



fvt 




n 

i=l 



Since Vt is a probability vector, this holds by Jensen’s inequality if fy^ is concave. 
Using the notation Ly{x) = L{y,x), we have 

fyi.x) = {-L'y{x)/c)exp{-Ly{x)/c) 



and 

f”{x) = {{L'^{x)lcf-L”{x)lc)exp{-Ly{x)/c) . 

Hence, since we assume Ly{x) to be positive, fy{x) < 0 holds if and only if 
c > L'y{x)^ / Ly{x) . Therefore, (till holds for the prediction yt — Vt ■ Xt if the 
constant c satisfies 



c > 






for i = 1 , . . . , n . 



This concludes the proof of Theorem El 

The result can be generalized to multi-dimensional predictions, as we see in 
Sect. El 



5 Bounds Based on the Relative Entropy 

We now wish to consider bounds in which the loss of the algorithm is compared 
not to the loss of the best single expert, but the loss of the best probabilistic 
combination of the experts. In particular, assume that at trial t we predict 
according to the prediction of an expert chosen at random, with expert £i having 
probability Ui of being chosen. For such probabilistic predictions, the expected 
loss over the whole sequence is given by 

n I 

Loss^''®(S') = ^ UiLossf, (S') = ^ M • it, , 

2=1 i=l 

where Lt denotes the vector of losses of the experts at trial i.e., Lt^i = 
L{yt,xt,i). 

As discussed in the introduction, we wish to bound the loss of the algorithm in 
terms of the average loss Loss^''®(S) and the distance d{u^Vi) between u and 
the algorithm’s initial weight vector Vi for some natural distance function d. For 
both the Aggregating Algorithm and the Weighted Average Algorithm, the most 
suitable distance is the relative entropy given by dre(u,v) = 

Our bound is then as follows. 
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Theorem 3. Let L he a monotone convex twice differentiable loss function, 
and let the Weighted Average Algorithm WAA use arbitrary initial weights V\ 
and parameter c = cl, where cl is as in m- Then for any sequence S = 
{{xi,yi), . . . , (xi, yi)) and for all probability vectors u we have 

L0SSwAa(<5) < L0SS”®(S') + Cl dre(M,fl) ■ (13) 

It is easy to see that also in Vovk’s original analysis one can use the distance 
dre{u, Vt) as done in the above bound. As a result one gets for the Aggregating 
Algorithm a bound like (0 with Cl instead of Cl- 

Proof of Theorem We express the progress towards the reference vector u 
as follows: 



dre{u, Vt) - d„e{u, Vt+l) = ^ u* In 



2=1 



= In 



2=1 



Vt,i 

wt+i^jWt 

Wt,iWt+i 



= —u ■ Lt/c + Ui In 



Wt 

= -u ■ Lt/c+ {Pt - Pt+i) /c . 



(14) 



Applying (Unj now yields 

L{yt, Vt) <Pt- Pt+l =u- Lt+C {dre{u, Vt) ~ dre{u, Vt+l)) ■ 
Summing over all the trials we obtain 

L0SSwAa(5') < Pi - Pf+l = L0SS^''®(S') + C (dre(w, « l) - dre{u, Vt+l)) ■ (15) 

Omitting the non-negative distance d,;e{u,vt+i) gives the bound JED of the 
theorem. □ 



To see some interesting details of the proof, notice that in (II 41 . the probability 
vector u is arbitrary. So in particular we can choose u — Vt and thus obtain 

- dre{Vt,Vt+l) = -Vt ■ Lt/c+ {Pt- Pt+l) /c . (16) 

Combining (1 1 411 and II 1 611 gives us the following fundamental connection between 
distances and average losses: 



Vf Lt=U - Lt+C {dre{u, vf) - dre{u, Vt+l) + dre{Vt , Vt+l)} ■ 

We conclude this section by pointing out a strong relationship between the up- 
date of the algorithm and the bound II I (-ill . One can show that the probability 
vector u that minimizes the right-hand side of the bound ll 1 3il is vt+i- With 
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this minimizer u = the value of the bound equals P\ — P^+i (which is 

the constant value of the right-hand side of cnj)- Thus, the weight vector Vt+i 
produced by the loss update at the end of trial t is the minimizer of the bound 
m with respect to the first t examples, and with this minimizer the bound on 
the first t examples becomes Pi — Pt+i- 

Alternatively, the update of the algorithm can be derived in an on-line fashion 
as Vt+i = argmin„f/((i;) where 

Ut{v) = cd^e{v,vt) + v ■ Lt 

and V is constrained to be a probability vector. Again, substituting the mini- 
mizing argument into Ut gives a potential difference, namely 



Pt - Pt+i = Ut{vt+i) < Ut{vt) =Vf Lt . 

Note that the above upper bound for Pt — Pt+i is complemented by the lower 
bound (inj that is central to the relative loss bounds proven for the expert 
setting. 

If we want to compare the loss of the algorithm to L{yt,u ■ Xt) instead oiu- Lt, 
a better update might result from Vt+i = argmin.yC/t(t;) where 

lJt{v) = cdre{v,Vt) + L{yt,v ■ Xt) 

and again v is constrained to be a probability vector. If the loss function is convex 
then L(yt, v ■ Xt) < v ■ Lt and Ut{v) bounds Ut{v) from above. The bounds that 
can be obtained for algorithms based on minimizing Ut IKW971MKW95I differ 
significantly from the style of bounds we have here. When the loss L{yt,yt) of 
the algorithm is compared to T(j/t, u ■ Xt), it is usually impossible to bound the 
additional loss by a constant (such as cl In n here) . However, bounds where the 
comparison is to L{yt,u ■ Xt) are in some sense much stronger than the expert 
style bounds of this paper. 

6 Multi-dimensional predictions 

We now consider briefly the case of multi-dimensional predictions. In other word- 
s, instead of having real numbers as outcomes yt, experts’ predictions Xt^i, and 
predictions yt, we now have vectors from (some subset of) R^, for some A: > 1. 
For instance, the experts’ predictions and the outcomes might be from the k- 
dimensional unit ball { a; € | 1 1 ic 1 1 2 < 1 } ■ Since the prediction of each individ- 

ual expert at a given time t is a A:-dimensional vector, all the expert predictions 
at time t constitute a fc x n matrix Xt- The prediction of the algorithm will 
still be a weighted average (i.e., convex combination) of the experts’ predictions: 
yt = XtVt where the weight vector Vt is maintained by multiplicative updates 
as before. A loss function is now defined on R^ x R^; a simple examples would 
be L{y,y) = \\y - y|||. 

Consider now the proof of our main result Theorem 0. The only place where we 
use the fact that the values yt and Xt^i are real numbers is in proving that the 
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function fy defined by fy{x) = exp(— L(y, a;)/c) is concave for all y. We do this 
proof by considering the sign of the second derivative of fy. 

In the multi-dimensional case, we analogously need to prove that the function 
fy defined by fy{x) = exp(— L(y, a;)/c) is concave. If we find a value for c such 
that this holds, then the rest of the proof goes as before and we again obtain 
the familiar bound Losswaa(5') < (min^ Lossf. (S')) -I- clnn. Alternatively we 
can use the relative entropy as in Sect. 0 and obtain the bound Losswaa(S) < 
Loss^''®(S) -I- cdie{u,vi) for any probability vector u. 

Consider now when fy is concave. Let us denote the gradient and Hessian of 
fy by ^fy and D^/^, respectively. We need to find out when D^fy is negative 
semidefinite everywhere. Thus, we have 



(V/^(a;)). 



dfyjx) 

dx^ 




dL(y,x) 

dx^ 



and 

^ dL{y,x) d^L{y,x) \ 

V V )ij dxidxj c ^ \c dxi dxj dxjdxj ) 



For 2 : S we now have fy{x) 2 ; < 0 if and only if 

{z ■ VLy{x)f jc - z'^B'^Ly{x) z<Q . (17) 

Note that in order to have this hold for all 2 we at least need to have z'^D^Ly{x) z 
positive, i.e., the loss L{y,x) needs to be convex in x. In this case we get for c 
the condition 

{z ■ VLy{x))^ 

^ ~ z'^T)‘^Ly{x)z 

where y and x in the supremum range over the possible values of outcomes and 
(single) experts’ predictions, respectively, and 2 ; ranges over R^. Comparing this 
with the constant cl defined in (0, we see that the first and second derivatives 
there are here in some sense replaced with first and second derivatives in some 
direction z, where the direction z is chosen as the worst case. 

As a first example, consider the square loss L(y, a;) = ||y — a;|| 2 . Then VLy(x) = 
2{x — y), and D^Ly{x) = 21 where I is the identity matrix. Hence, we get 

(z • VLy{x))^ (z • {2x — 2y)^ 
z'^D'^Ly{x)z 2z^ ’ 

and this expression obtains its maximum value 2(a; — y)^ when z is parallel to 
x — y. Hence, if the outcomes y* and the experts’ predictions Xt^i are from a ball 
of radius R, so {x — y)^ < 4i?^, we can take c = 8i?^, which gets us the bound 

Losswaa(«S') < Loss^''®(«S') -I- SR^lnn 



for any u. 
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Since the square loss in the multi-dimensional case is simply the sum of square 
losses on individual components, we could try handling the fc-dimensional case 
simply by running k copies of the Weighted Average Algorithm and predicting 
each component independently of each other. Let us denote the resulting algo- 
rithm by WAA(fc) and compare this approach to the one analyzed above. It is 
easy to see that if we allow the experts’ predictions and outcomes in the one- 
dimensional case to range over [-B, B] instead of [0, 1], we must for square loss 
replace the constant cl = 2 by cl{2BY = 8B^. The bound we get is then 



k 

LoSSwAA(fc) (S) < 

i=i 



min 



+ 8kB^ In n . 



Comparing this with the bound we have for the true multi-dimensional Weighted 
Average Algorithm (WAA), we see that the first term in the bound for WAA{k) 
can be much lower if there are experts that are good for predicting some but 
not all of the components. This potential for better fit is what WAA{k) gains 
by having kn instead of n weights. On the other hand, the second term in the 
bound for WAA{k) is linear in k, which is where WAA(fc) loses for having so 
many weights. (Of course, depending on how the vectors yt and Xt^i are located 
in the factor 8R^ in the bound for the true multi-dimensional WAA may 
also grow linearly in A:.) 

As another example, consider the relative entropy loss T(y, x) = 
Vj where we assume that y and x are in the probability simplex: 

yi, Xi>0 and Xj = yj = 1- Then 



dLy{x) _ 

dxt Xi 

and 

d^Ly{x) _ ^ 

dxidxj xf ’ 

where 5ij = 1 for i = j and 6ij = 0 otherwise. Now, given y, x and a vector 
2 ; G R^, let Q be a random variable that for i = 1, ... ,k takes the value qi = Zijxi 
with probability yi. We can then write 



k k 

Vt 



z -VLy{x) = -^Zi— = -^yiQi = -F,[Q] , 



i=i 



Xi 



i=i 



and similarly 



z'^B>^Ly{x)z = = y^q^ = E[g^] . 

i=i i=i 



{z-WLy{x)f E[Q]^ 
z'^B‘^Ly{x)z E[g2] - 



Thus, we have 
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by the usual properties of random variables. Hence, for relative entropy loss we 
have 

Loss\vaa(5') < ^minLosSf:^(S')^ +lnn 
even in the multi-dimensional case. 
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Abstract. We consider the self-directed learning model 0 which is a 
variant of Littlestone’s mistake-bound model UIIUI . We will refute the 
conjecture of that for intersection-closed concept classes, the self- 
directed learning complexity is related to the VC-dimension. We show 
that, even under the assumption of intersection-closedness, both param- 
eters are completely incomparable. 

We furthermore investigate the structure of intersection-closed concept 
classes which are difficult to learn in the self-directed learning model. We 
show that such classes must contain maximum classes. 

We consider the teacher- directed learning model in the worst, best and 
average case performance. While the teaching complexity in the worst 
case is incomparable to the VC-dimension, large concept classes (e.g. 
balls) are bounded by VC-dimension with respect to the average case. 
We show that the teaching complexity in the best case is bounded by 
the self-directed learning complexity. It is also bounded by the VC- 
dimension, if the concept class is intersection-closed. This does not hold 
for arbitrary concept classes. We find examples which substantiate this 
gap. 



1 Introduction 



The self-directed learning model was first introduced by Goldman et al. 0 as 
a variant of the mistake-bound model of Littlestone m and has attracted 
much attention in the meantime. It was successively studied by |8fil2ll2lfill| . In 
this model, the learner makes predictions on adaptively chosen instances in order 
to gain knowledge of the concept to be learnt and is charged with a mistake, if 
his prediction was wrong. 



Goldman and Sloan jH] extensively studied this model with respect to different 
families of concept classes and investigated the self-directed learning complexity 
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in relation to the VC-dimension. They conjectured that for any concept class, 
the self-directed learning complexity is bounded by its VC-dimension up to a 
constant factor, and that the upper bound is exactly the VC-dimension, if one 
requires the concept class to be intersection-closed. 

Ben-David et al. 0 refuted both conjectures showing that the self-directed learn- 
ing complexity is completely incomparable for arbitrary concept classes. They 
additionally introduced a family of intersection-closed classes with VC-dimension 
2d and self-directed learning complexity 3d. They conjectured that 3/2 is a gen- 
eral maximum ratio of these two parameters for intersection-closed classes. 

We will show that, even for intersection-closed concept classes, the self-directed 
learning complexity is completely incomparable to the VC-dimension introducing 
a family C^, d > 2,m > 1, with VC-dimension d and self-directed learning 
complexity at least m. 

This implies that the minimum size of an unlabeled compression scheme is also 
not bounded by the VC-dimension, even if the concept class is intersection- 
closed. Up to now, this was only proved for arbitrary concept classes. 

We show that intersection-closed concept classes of VC-dimension 2 with high 
self-directed learning complexity have a special structure. They contain maxi- 
mum classes of the same size as the complexity. 

The teacher- directed learning model introduced by Goldman and Kearns jS| fo- 
cusses on the minimum number of labeled instances a helpful teacher has to 
present such that any consistent learner is able to identify the target concept. 
They consider the corresponding teaching complexity in the worst case scenario, 
i.e. the minimum number of labeled examples the teacher has to present such 
that the learner identifies the ’most difficult’ concept. The teaching complexity 
in the worst case is also completely incomparable with the VC-dimension, even 
if the concept class is intersection-closed. 

Here, we consider this model in the average and best case scenario. In the av- 
erage case, we make a complete analysis for concept classes of VC-dimension 
1. Furthermore, we prove that for concept classes (balls of center c and 

size < d), the teaching complexity in the average case is bounded by 2d, i.e. 
2YCB{B‘^{c)). 

The best case model is obviously stronger than the other teacher-directed mod- 
els, but we show that it is also stronger than the self-directed model. The teach- 
ing complexity is bounded by the VC-dimension, if the concept class admits 
most specific concepts (which is a superfamily of intersection-closed classes). 
We present examples substantiating that this is not true for arbitrary concept 
classes. 

2 Basic Definitions and Learning Models 

Let X be an instance space of n instances and C C 2^ . We call C concept class, its 
elements c G C concepts. If it is convenient, we regard a concept c as a mapping 
from X to {0, 1} where c{x) = I, iff a; G c. 
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Let z denote a subset of X which we call a sequence of X. We define IIc{z) = {cfl 
z : c G C} to be the set of intersections of z with all concepts of C. Furthermore, 
we say that z is shattered by C, if nc{z) = 2^. Let IIc{d) = max|j,|^j; |7Tc(z)|. 
Then we define the Vapnik-Chervonenkis Dimension as the maximum size of a 
sequence of X shattered by C, i.e. 

VCD(C) = max{d : iTc(d) = 2‘^}. 

Let <Pd{n) = {!)■ Then Sauer’s Lemma |I3j implies that the cardinality of 

a concept class C with VC-dimension d is bounded by •d>d{n), i.e. \C\ < d>d{n). 
Let z be a sequence of X of size |z| = m. Let / be a mapping from subset z 
into {0,1}. Then we call z\f = {{x,f{x)) : x G z} a, labeled sequence (of size 
m), which consists of a set of m labeled instances. If / is constant, i.e. / = 1 
(/ = 0, resp.), then we simply write z|l (z|0, resp.). If z is labeled according 
to a concept c, then we write z|c. A concept c is called consistent with z|/, iff 
z\f = z\c. According to this, we call C^|/ = (c G C : c is consistent with z|/| the 
version space of C with respect to z\f. Now we define S{C) and G{C), respectively, 
to be the set of the concepts of minimum and maximum size, respectively, in 
C. We call a concept most specific (most general, resp.) with respect to z|/, iff 
c G (c G G(C^|/), resp.). 

We say that C admits a unique most specific concept, formally written as C G 
MSC, iff for every labeled sequence z\f, |«S'(C 2 |/)| < 1. We call C intersection- 
closed, denoted by C G IC, iff for all ci, C 2 G C, also ci fl C 2 G C. 

Let X = {x\, ...,Xn\ and C = {ci,...,Cm|. Then the incidence matrix of C is 
defined as the binary n x m-matrix, where entry o* = 1, iff Xi G Cj. Hence, the 
columns of the incidence matrix represent the concepts of C. For the sake of 
simplicity, we will consider C as its incidence matrix, if it is convenient. 

Let C G MSC. Then we define X(c) as the set containing all subsets z of minimum 
size with c as their mosts specific concept, i.e. I{c) = {z : S'(C 2 |i) = |c}}. These 
minimum sets are called the spanning sets of the concept c. I{C) = maxc |T(c)| 
denotes the maximum size of the number of spanning sets over all concepts in 
C. 

Let Ind{C,x) = |{c G C : a; G cj| be the index number of x and Ind{C) = 
minx^xInd(C ,x) the index number ofC. Finally, we define the inverse concept 
class C = {c:c = A — c, cG C}. 

Self-directed model A self-directed learning algorithm L for a concept class 
C selects labeled instances and presents it to the environment as a prediction . 
After each prediction, the environment reveals the true labeling of the selected 
instance with respect to the target concept. 

The choice of the labeled instances is adaptive, i.e. it may depend on the classi- 
fication of the previously seen instances. 

Let Mg^(C, L, ctj denote the number of wrong predictions (mistakes) L made to 
uniquely identify the target concept ct- Then the self-directed learning 
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complexity is defined as 



Mg^(C) = min max(Mg^(L, ct)) 

L c± GC 

Teacher-directed model We define the teaching complexity for C with respect 
to the target concept Ct, denoted by Ct), as the minimum size of a sequence 

z which uniquely identifies a target concept ct G C, i.e. \Cz\ct \ = 1- 
This parameter can be regarded as the shortest size of a sequence a helpful 
teacher has to present as a labeled sequence z|ct such that any consistent learner 
uniquely identifies the target concept Ct. 

We call a sequence z of minimum size which uniquely identifies a target concept 
Ct GC minimum sequence for Ct- 

We define the teaching complexity of C distinguishing between the best and the 
worst and auerage case performance: 

'^td-wors ^{C) = m|x(M^^(C,Ct)) 

o))- 

In order to define the average case, let P be a distribution over C. Then the 
average case performance of the teaching complexity (auerage case) w.r.t. the 
distribution P is 



^td-average^^ ^ 

cGC 

The teaching complexity in the worst case performance is identical to the teach- 
ing dimension introduced by Goldman et al. 

It is known that ^td-morst arbitrarily high, if the VC-dimension is con- 

stant - even under the assumption of intersection-closedness 0 (it is easy to 
see that , e.g., singleton U 0 is intersection-closed with VC-dimension 1, but 

In the sequel, we will deal with the other two models ^td-average ^td-best 
in order to investigate their behaviour with respect to arbitrary and intersection- 
closed concept classes. Obviously, the hierarchical structure of the learning mod- 
els is 

^td-best(<^) ^ ^td -auerage^^^ — ^td-wors tin- 
But it also holds that 

^td-besti^^) ^ 

We will prove this in section Q (Lemma 0 • 

Hence, ^td-best stands hierarchically above all other models which we consider 
in this context. 
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3 Preliminary Results 

A basic tool to apply the property of intersection-closedness is that there exist 
d+ 1 concepts which witness that the VC-dimension is at least d. We can express 
this observation in terms of the occurrence of a suhmatriJ^ Dd = in 

the incidence matrix of the concept class. A* = 0, iff * = j- In other words, 
is a d X (d + l)-matrix which has a diagonal with 0, and all other entries are 1. 
Now we have the following 

Lemma 1. LetC be an intersection-closed concept class. C contains a submatrix 
Dd, if and only if VCD(C) is at least d. 

Proof. Assume the existence of Dd. Let Xi^...,Xd be the instances (rows) and 
Co , Cd the concepts (columns) of Dd. Then, for every labeling / of the instan- 
ces, the intersection cn = n/(xi)=i i® i’^ ^ intersection-closedness, 

and xi,...,Xd\f = x\, ...,Xd\cn. This implies that x\,...,Xd is shattered by C. 
The other direction is evident because the VC-dimension implies the existence 
of d instances on which every labeling occurs. Hence, C also contains Dd on the 
d instances. ■ 

Now we briefly investigate the learning properties of concept classes of VC- 
dimension 1. First, the following structural result holds: 

Lemma 2. Let C be an arbitrary concept class with VC-dimension 1. Then ei- 
ther Ind{C) < 1 or Ind{C) < 1. 

Proof. Sketch Consider a concept cq G C. With respect to this concept we deflne 
a new concept class C[cq\ = {c : c = cAcq, c G C}. In other words, we invert those 
entries of rows of the incidence matrix of C for which cq{x) = 1. This clearly 
has no influence on the VC-dimension and, obviously, contains the empty 
set Co = cqAcq. We will And an x such that /nd(C[co], a:) < 1. Then, either x is 
only contained in cq {Ind(C) < 1) or a; is contained in every other concept but 
Co {Ind{C) < 1). This proves the lemma. 

In order to show Ind{C[c^]) < I, we assume the contrary, i.e. Ind{C[cg]) > 2. It 
can be shown by induction that for an arbitrarily large k we can And a sequence 
of different instances and corresponding concepts Xi,ci, ...,Xk,Ck G {X x C[cq])^ 
with the property that for all i and all j > i, xi, ...,Xi G Ci and Xj ^ c^. This is 
a contradiction, since X is finite. ■ 

Lemma 3. LetC be an intersection- closed concept class with VC-dimension 1. 
Then Ind{C) < 1. 

Proof. The proof is similar to the one of the previous lemma, but this time the 
construction of C[co] is not necessary. It is a consequence of the intersection- 
closedness that, with the assumption Ind{C) > 2, the empty set is in C. ■ 

^ A submatrix of a matrix A is obtained by deletion and permutation of arbitrary rows 
and columns of A 
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Finally we want to mention that Lemma 0 enables us to prove a result of Gold- 
man and Sloan 0 in a alternative, compact way: 

Theorem 1. Let C he a concept class with VCD{C) = 1. Then Mg^{C) = 1. 

Proof. We introduce a learning strategy which relies on the previous lemma. 
The self-directed learning strategy goes the ’safest’ way and chooses the instance 
which is covered by one or by all but one concepts of the version space, respec- 
tively. Its prediction is the label of the majority. Hence, a mistake identifies 
uniquely the target concept. ■ 



4 Self-directed learning of intersection-closed concept 
classes and the VC-Dimension 



/I 

P 

P 

P 



h 



— E^{Q) 

h h 



jm+l 



-|| E2(Q) |f-E'3(Q)] 



^m+1 




Fig. 1. The invariant extension of an intersection-closed incidence matrix Q 



Repeatedly, it was conjectured that the self-directed learning complexity is bound- 
ed by the VC-dimension (up to a constant factor), if the concept class to be learnt 
is intersection-closed. Goldman and Sloan 0 supposed that is bounded by 
the VG-dimension for intersection-closed concept classes. Ben-David et al. P] 
disproved this by a family of examples where the ratio of the two values is 3/2. 
They conjectured that this ratio holds as an upper bound for all intersection- 
closed concept classes. 
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In this chapter, we will refute this conjecture showing that there exist inter- 
section-closed concept classes of constant VC-dimension and arbitrarily large 
self-directed learning complexity. 

Theorem 2. For each m > 1 and each d > 2, there exists a concept class 
with the following properties: G IC, VCD{C!ff) = d, Mgg{C^) > m 

Proof. Sketch The basic idea of the proof is to extend a concept class recur- 
sively according to a rule which leaves the VC-dimension and the property of 
intersection-closedness invariant. Then we show that if the concept class is ex- 
tended m times, for each learner, the adversary is able to save at least the m — 1 
times extended concept class in the version space. 

For the extension rule we consider the incidence matrix of the concept classes. 
Figure Q] illustrates the incidence matrix of the extended concept class. 
Formally, let Qnr be an (n x r)-matrix. We furthermore define as the (n x r)- 
matrix where all rows are 0 except the zth row which is 1. 

Then we consider the ((n-|- l)n x (n+ l)r)-matrix Ei(Qnr) which contains n-|- 1 
blocks of n rows and n -I- 1 blocks of r columns. We denote the row blocks as 
I^, and column blocks as Ii, ...,In+i where denotes the A:th instance 

(row) in the ith row block and Ij^i denotes the Zth concept (column) in the jth 
column block, respectively. We further denote the block of row block P and 
column block Ij as If and the label of the Zth instance with respect to the kth 
concept in this block as Pf^. E{Qnr) contains n -I- 1 copies of Qnr in its main 
diagonal If. The other blocks contain the matrices More precisely, for all 
i < j the block /) contains and for all i > j the block /) contains Af^f^ . 
Furthermore we define matrix E 2 {Qnr) as an ((n -1- l)n x ^n{n + l)-matrix of 
columns where, for each two different blocks Ii,Ij, i < j, of Ei{Qnr), there is a 
concept containing exactly two instances in these blocks at row ^ and P’b 
Now E^{Qnr) is the ((u-|- l)n x (n-|- l)u-|- 1-matrix which is the identity matrix 
with an additional column of zeros. 

Finally, we obtain E{Qnr) = Ei{Qnr)\E 2 {Qnr)\E^{Qnr) (where | is the matrix- 
concatenation) which represents the incidence matrix of the extended concept 
class we want to consider. 

Claim 1 Qnr G IC E{Qnr) G IC 

Claim 2 If Qnr G IC and VCD(Qnr) = d> 2, then VCD{E{Qnr)) = d. 

Claim 3 For all labeled instances (x,l) G X x {0,1}, Qnr G E{Qnr)x\i 
The claims are proved in the appendix. 

Now we complete the proof of the lemma. Let Alf be the incidence matrix of 
an arbitrary concept class Cf with Cf G IC and VCD(C)^) = d. We inductively 
define as the concept class which corresponds to E{Mfn-i). According to 
Claim 1 and 2 , Cfn G IC and VCD(Cf^) = d. Finally, Claim 3 states that the 
adversary can force the learner to make a mistake and the version space still 
contains C!fn-i - Hence, after m mistakes the version space still contains at least 
Of. ■ 
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Floyd an Warmuth (^) establish the relationship between self-directed learning 
and unlaheled compression schemes. An unlabeled compression scheme can be 
interpreted as a self-directed learner who is restricted to choose his predictions 
from a particular sequence. Thus, the self-directed learning model is at least as 
powerful as the unlabeled compression scheme. 

With the above result we can formulate the following 

Corollary 1. The minimum size of an unlaheled compression scheme is not 
hounded hy any function of the VC-Dimension of intersection- closed concept 
classes. 

5 The structure of intersection-closed concept classes 
with high learning complexity 








Fig. 2. The maximum matrix Ak which is included as a submatrix in the inci- 
dence matrix 



In the last section we proved that, even under the assumption of intersection- 
closedness, the self-directed learner can be forced to make arbitrarily many mis- 
takes to learn a concept class with constant VC-dimension. 

For these cases, it is sometimes interesting to receive knowledge about the struc- 
ture of the concept class to be learned. In this chapter, we investigate the struc- 
ture of concept classes of VC-dimension 2 with self-directed learning complexity 
k and find out that we can find a maximum suhclas^ contained in the concept 
class. 

We define the k x ^k{k l)-matrix Ak = A^, ..., A^O where A^ is a A: x i-matrix 
containing the unit matrix of size i in the last i rows. The other entries are 
labeled with 1. 0 indicates the zero vector. Figure El illustrates the form of this 
matrix. This matrix represents a maximum class, since it has VC-dimension 2 
and the number of labelings on the k instances is maximum size (Sauer’s lemma). 



^ A maximum subclass is maximum in the sense that its VC-dimension increases, if 
an arbitrary concept is added. 
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Theorem 3. Let C G IC with VCD{C) = 2 and let Mg^{C) = k. Then the 
incidence matrix of C contains a suhmatrix ■ 

Proof. Sketch We follow the strategy of the learner who always chooses the 
instances xi,X2,... with the minimum index number of the version space. Ex- 
actly the concept class restricted to the subsequence the learner made wrong 
predictions, has the structure of Ak- 

For the detailed proof we refer to the extended version of this paper. ■ 

6 Teacher-directed learning - average case 

In this chapter, we investigate how the teaching complexity behaves on a con- 
cept class in the average case. Throughout this chapter, U denotes the uniform 
distribution on C. We first observe that the teaching complexity of a concept cq 
is related to the existence of concepts which are ’close’ to cq . 

Definition Let z\fo be a labeled sequence. Then we define the k-ball with cen- 
ter z\fo, B^{fo) = {z\f : 4(/o,/) < k}, where 4(/o,/) denotes the Hamming- 
distance of two labelings with respect to the sequence z. li z = X then we simply 
write the k-ball with center cq. We say that C contains a k-ball with cen- 

ter z|co, iff B^(co) is a subset of Cjz. Since the VC-dimension of a fc-ball is k, a 
concept class containing a fc-ball has VC-dimension of at least fc. 

Lemma 4. Let Ct € C. For any minimum sequence z for Ct, C contains a 1-ball 
with center z\ct- 

Proof. Let z be a minimum sequence for ct, i.e. |Cz|ct| = 1. Then for every i, 
\Cz-xi\ct \ must be at least two, otherwise z — Xi is already a minimum sequence. 
Therefore, for all i, \Cz-xi\ct;xi\ci \ is at least one. ■ 



Theorem 4. Consider C = {co,...,Cm} with VCD{C) = 1. Then there exist 
fco, ..., fcm G No with J2iLo ^ such that 

m 

^td-average^^ ' ~ kiPicf). 

i=0 

In particular, for the uniform distribution U, < 2. 

Proof. Sketch Since every concept class with VC-dimension 1 is embeddable 
into a maximum class |2, we can assume that the instance space has size m. 
We prove the first part of the lemma by induction over size m of the instance 
space. For m = 1, the lemma holds trivially. 

Let us now assume, that the lemma holds for m — 1. Now let the instance space 
V be of m elements. For an instance x G X define C\{X — x) as the restriction 
of C to V — a;. We call (ci, C2)x a pair of x, if x ^ c\ and ci U {a;} = C 2 . We call c 



On Teaching and Learning Intersection-Closed Concept Classes 177 



independent, if it does not belong to any pair. Let kc be the length of the smallest 
sequence identifying c in C (i.e. kc = M^^(C,c)), and let kc be the smallest 
sequence identifying the restriction of c on C\{X—x) (i.e. kc = M^^(C|(X— x), c)). 
Then, according to the definition, ^td-aygj-age^^^ ~ X^ceC ^cP{c). 

Claim 1 There exists at most one pair of x in C. 

Claim 2 Let (ci,C 2 )z be a pair in C. Then, up to a permutation, k^ = /Cci + 1 
and kc 2 = 1- 

Claim 3 For all independent concepts c, kc = kc- 
We prove the claims in the extended version of this paper. 

Since, with the restriction of C to X — x, only C 2 of a pair (ci, C 2 )x is removed, 
C = {C\{X — x)) U {C 2 }. Hence, if C does not contain a pair, then, according to 
Claim 3, 

^kc= ^ kc<2{m-l). 

cec cec\(x-x) 

Otherwise, Claim 1 implies that C contains exactly one pair and with Claim 2, 

kc = y]] kc + 2 < 2m. 

cec cec\(x-x) 

This completes the induction. 

For the uniform distribution U, we obtain ^td-average^^’ — ^m+i' ^ " 



Lemma 5. For every ball B^{c), the measures VCD, <md 

Mtd-best independent from center c. 

Proof. This is easy to see, since to different balls B^{ci) and B^{c 2 ) can be 
transformed into each other by simply inverting those rows of the incidence 
matrix for which ci{x) yf C 2 {x). This clearly has no influence on any of the four 
measures. ■ 



Theorem 5. Let B'^{c) be a ball with an arbitrary center c and U be the uniform 
distribution over B’^{c). Then M^d-average^^"^^^'^^ — 

Proof. Due to the previous lemma, it suffices to the ball with center 0, i.e. 
B"^ = {c : |c| < d} Partition B"^ into Cq, ...,Cd, where Ci = {c : |c| = i}. Clearly, 
for alH < d and all c G Ci, c) < n. For c G Cd, it holds M^g{Cd, c) < d, 

since every sequence is determined by its positive labels. Hence, applying the 
easy observation that <Pd-i{n) < :^^d>d{n). 



^td-average^^ ’ — 



=d+(„-d)tlW<2d, 



^d{n) 
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7 Teacher-directed learning - best case 

In the previous sections we demonstrated that neither for the self-directed learn- 
ing complexity nor for the teaching complexity (worst case) the learnability 
of intersection-closed concept classes is bounded by any function of their VC- 
dimension. In this section, we investigate We will see that, in fact, the 

complexity is bounded if C is intersection-closed. This is not true for arbitrary 
concept classes as we will see by an example. We introduce a family of concept 
classes with VCD(C'^) = 2d and = 3d. 

In section0we asserted that is more powerful than any other learning 

model considered in this paper. It remains to show that is even more 

powerful than the self-directed learning model. 

Lemma 6. For any concept class C, < M^^{C) 

Proof. The idea of the proof is to run the ’greedy’ adversary, who provoces the 
self-directed learner Lsd to make mistakes, whenever it is possible. Let z be 
the sequence Lsd presents. Let {xi, ..., C z be the instances on which Lsd 
made a mistake and {yi , ...,?//} C z those for which the adversary was not able 
to reveal another label but the one issued from the learner. Then x\, ...,Xk\ct 
already uniquely identifies the target concept c*. 

Consider Cz\ct which contains only the target concept ct- If we remove yi from the 
sequence then contains only ct since all concepts of the version space 

provide yi with the same labeling. Therefore yi cannot gain more information. 
With the same argument, we can remove successively all instances y\, ...,yi from 
the sequence z and, hence, only contains the target concept. ■ 

Theorem 6. For every concept class C G MSC, is bounded by 

VCD{C). 

Proof. We prove that, for C e MSC, ^td-best^^^ — ^(^)- 

Let Ct G G(C) and take Zt G I{ct) which identifies Ct as its most specific concept, 
i.e., iS'(Cj,|i) = {ct}. Since there is no concept c G C containing Cf, = {ct}. 
Hence, < |z| < 1(C). 

Since Natarajan CD shows that 1(C) is bounded by the VC-dimension of C, the 
theorem is proved. ■ 

The above theorem does not hold for arbitrary concept classes. The property C G 
MSC cannot be dropped to bound the teaching complexity by the VC-dimension. 
In order to show this, we construct a concept class E{C) with VCD (if (C)) = 2 
and M^^_^gg^(if(C)) = 3. Let C = {ci, C2, C3, C4} be the powerset of V = {xi, X2}. 
Clearly VCD(C) = 2. Now let E{C) be the concept class whose incidence matrix 
consists of 8 X 8 blocks of size 2x3. We use the same notation of the blocks, 
rows and columns (/*,/j,...) as in the proof of Theorem El Figure 0 illustrates 
the extension matrix of C. The entries of the diagonal are those three vectors 
of C, which are not in the other blocks in the same column. More precisely: 
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Fig. 3. The expansion of powerset C to a concept class E{C) with VCD(if(C)) = 
2 and learning complexity = 3 

I\ = C — C|-j/ 2 ] • And for all i ^ j and A: = 1, 3 = cp/ 2 ] • Note that, on 

every column block /j, all instances are constant up to those in row block P . 

Theorem 7. There exists a family of concept classes (C^) with VCD{C^) = 2d 
and 

Mtd-best(C^) = 

Proof. For the proof of VCD(if(C‘^)) = 2 and = 3, we refer to 

the extended version of this paper. Given this, the statement is established for 
the case d = 1. From this we easily obtain the case for arbitrary d combining d 
copies of the concept class, C = C x ... x C. ■ 

8 Conclusion and Open Problems 

In this paper we discussed the self-directed learning complexity and teaching 
complexity, respectively, in relation to intersection-closed concept classes. We 
completely solved the question of the relation between the self-directed learn- 
ing complexity and the VC-dimension under the assumption of intersection- 
closedness. We showed that both parameters are completely incomparable. 
Knowing that the self-directed learning complexity of intersection-closed con- 
cept classes may be arbitrarily large while the VC-dimension is constant, we 
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investigated the structure of such a concept class. We proved, that for concept 
classes C with VCD(C) = 2 and M^^(C) = k, the concept class must contain a 
maximum subclass of size k. This raises the question 

Open Problem 1 Is it possible to extend the structural result given in Theorem 
\3to concept classes with higher VC-dimension? 

In section Eland 0 we investigated the teacher-directed complexity in the average 
and best case performance. Next to the full analysis of concept classes of VC- 
dimension 1, we show that in the average case, the teacher-directed complexity 
of the classes B'^{c) is bounded by two times the VC-dimension of the class. 
Open Problem 2 Does Theorem 0 hold for any inter section- closed concept 
class, i.e. U) < 2d for the uniform distribution U over C? 

The best case analysis of teacher-directed complexity establishes a gap between 
arbitrary and intersection-closed concept classes. For intersection-closed concept 
classes, the teacher-directed complexity is bounded by the VC-dimension. On 
the other hand we found a concept class for which the teacher-directed com- 
plexity (best case) exceeds the VC-dimension by a ratio of 3/2. This gap is a 
strong result since we show that is the strongest model considered in 

this paper. 

Open Problem 3 Are there families of concept classes with constant VC- 
dimension, but arbitrary complexity? 



A Appendix 



Proof. (Theorem |21) 

In order to prove the three claims, consider i j the bijective function 



Khj) 



i,j for i > j 
i,j — 1 for i < j. 



Note that ix{i,j) indicates the position of the instance labeled by 1 in block J®. 
Also note that, in matrix E 2 (Qnr), for different blocks i,j, there is a concept 
containing exactly two instances in these blocks at row ^nd 

Claim 1 Qnr G IC ^ E{Qnr) G IC 

We have to show that, for two concepts ci,C 2 G E{Qnr), also the intersection 
belongs to the concept class, i.e. ci C C2 G E{Qnr)- 

Case 1: ci, C2 G A 

Then ci D C 2 G A. This is clear because the concepts in A are constant on /^, if 
k ^ i, and Qnr is intersection-closed. 

Case 2: ci G A, C 2 G Ij 

Without loss of generality, let i < j. The intersection of ci and C2 is clearly zero 
on for k ^ i,j. As on P, C2 contains exactly one instance at position j — 1, and, 
conversely, on P , ci contains exactly one instance at position i, ci riC2 contains 
at most two instances which are at position /*>•?- 1 and /-I’®. These concepts are 
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contained either in E2{Qnr) or in E^{Qnr)- 
Case 3. Ci C 02 C -E'2(Qnr)'^3(Qnr) 

If Cl n C2 ^ C2 then ci H C2 it contains at most one instance which is clearly in 
E^{Qnr) ■ 

This proves Claim 1 . 

We see that E2{Qnr) and E^(Qnr) are necessary submatrices in order to assure 
the intersection-closedness of the extended concept class. 

Claim 2 If Qnr S IC and VCD(Qnr) = d> 2 , then VCD{E{Qnr)) = d. 

Let VCD(Q„r) = d. Assume that VGD{E{Qnr)) > d -I- 1 and X\...Xd+i be 
instances shattered by E{Qnr)- Then not all instances can be in one row block P 
as otherwise this row block would contain the submatrix Dd+i which could only 
appear in = Qnr because more than one instance is labeled positively. This 
would imply that VCD{Qnr) > d + 1 (see Lemma QJ and, therefore, contradict 
our assumption. 

Hence, at least two row blocks contain shattered instances. Let us consider three 
of the shattered points {xi,X2,X3) which are contained in different row blocks. 
We will show that they cannot be shattered. 

Case 1 : a;i,a;2 G P and X3 G P 

Consider concept C123 which contains all three instances. C123 is in A (the only 
column block which can contain more than one instance of P) and X3 = 

Now we look for the concept C12 which contains xi,X2 but not X3. As it must 
contain two instances of row block P, C12 must be in li. This would imply that 
x^ is contained in C12 which is a contradiction. 

Case 2 : X\ G P, X2 G P and x^ G P 

Consider again the concept C123 which contains all three elements. Now without 
loss of generality, we assume C123 to be in A. Then clearly X2 = and 

X3 = iP'-P . We furthermore consider C12 which contains X\,X2, but not x^. Due 
to the fixed position of X2, this concept can be either in P or in E2{Qnr)- Both 
cases fix xi on its row position: 

If C12 is in P , then clearly a;i = iPiP . 

If C12 is in E2{Qnr)-, then the concept contains exactly two elements at row iPiP 
and i again, xi = iPiP. 

We now look for concept C13 which contains xi,xs and not X2- This concept fixes 
on row jpk,i) is different from iPtP as fi is bijective. 

This implies that, in both cases, the three instances are not shattered in contra- 
diction to the assumption that x\...Xd+i are shattered by E{Qnr)- 

Claim 3 For all labeled instances (x,l) G A x { 0 , 1 } 



Qnr G E(^Qnr')x\l 
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Let X = P’^ be an arbitrary instance given out by the learner. If the adversary 
chooses a target Ct containing x, then Ij is part of the version space and therefore 
block = Qnr is a submatrix in the version space. The same holds for the case 
that Ct does not contain x. Here, all Ik, k ^ j are part of the version space and 
at least one Qnr is a submatrix in the version space. ■ 
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Abstract. The present work introduces and justifies the notion of hy- 
perrobust learning where one fixed learner has to learn all functions 
in a given class plus their images under primitive recursive operators. 
The following is shown: This notion of learnability does not change if 
the class of primitive recursive operators is replaced by a larger enu- 
merable class of operators. A class is hyperrobustly Ex-learnable iff it 
is a subclass of a recursively enumerable family of total functions. So, 
the notion of hyperrobust learning overcomes a problem of the tradi- 
tional definitions of robustness which either do not preserve learning by 
enumeration or still permit topological coding tricks for the learning cri- 
terion Ex. Hyperrobust BG-learning as well as the hyperrobust version of 
Ex-learning by teams are more powerful than hyperrobust Ex-learning. 
The notion of bounded totally reliable BC-learning is properly between 
hyperrobust Ex-learning and hyperrobust BG-learning. Furthermore, the 
bounded totally reliably BC-learnable classes are characterized in terms 
of infinite branches of certain enumerable families of bounded recursive 
trees. A class of infinite branches of a further family of trees separates 
hyperrobust BC-learning from totally reliable BC-learning. 



1 Introduction 

Self-reference and coding-tricks are an elegant way to prove many separation 
results in inductive inference. For example, the class of all functions / such that 
/(O) is a program for / separates finite learning from the criterion Num which 
contains all classes that are subsets of enumerable families of total functions. 
Similarly, the class of all functions / where /(O) is a program which computes / 
on almost all (but not necessarily all) places witnesses that BC-learning is more 
powerful than Ex-learning jnj. Such coding tricks allow to build simple proofs 
by using the following method: the less pretentious learner can evaluate the 
provided self-referential information — it is usually almost the desired output. 
But the more pretentious learner has to transform the information to information 
of higher quality (for example a program of a partial function into a program 



P. Fischer and H.U. Simon (Eds.): EuroCOLT’99, LNAI 1572, pp. 183-^^3 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



184 Matthias Ott and Frank Stephan 



of a total extension) which turns out to be as difficult as well-known unsolvable 
recursion-theoretic problems 

Barzdiiis proposed several notions of robust learning in order to find a concept 
of learning, where decoding self-referential information cannot be any longer 
the essential part of learning. In particular, he was interested into the question 
whether learning by enumeration is the only type of learning, where coding does 
not help. His basic hypothesis was that no kind of coding trick was preserved by 
all recursive operators. So he strengthened the notion of learning by requiring 
that not only the class S itself but also each image 0{S) must be learnable for 
all suitable operators O. Clearly, O has to be recursive, but it was discussed 
whether O must map total functions to total ones, that is, must be general 
recursive. Jain, Smith and Wiehagen analyzed this question and discovered 
that all proposed notions roughly behave like one of the following two cases: 
(a) The operator is only required to map the functions in S to total functions. 
Then robust learning does not even preserve Num: Some operator maps some 
class of constant functions onto the class of all recursive functions which is not 
learnable. (b) The operator is required to be general recursive, that is, the image 
of every total function has to be total. Then there is a class outside Num which 
is still robustly Ex-learnable. 

Case (b) looks much more natural, since any notion of learning in the limit 
should cover Num, in particular, every class of constant functions should be 
learnable. In his original definition of robust learning, Fulk [Z| followed this path 
and defined that 

a class S is robustly learnable if 0{S) is learnable for all general recursive 
operators O. 

Furthermore, he constructed already a class outside Num which is robustly learn- 
able in the limit. Jain, Smith and Wiehagen [3 showed that Fulk’s result can 
even be obtained using some topological kind of self-referential coding trick. 
They constructed a class of functions fi, f 2 , ■■ ■ which converge pointwise to one 
function / such that, for every general recursive operator 0, either almost all 
0{fk) are equal to 0{f) and the class to be learned is finite or the point where 
0{fk) and 0{f) become different is, for almost all k, an upper bound on a pro- 
gram for fk- Having such an upper bound, one can find a program for 0{fk) in 
the limit. 

So, there is some demand to find a notion of robustness which on the one 
hand prevents the use of coding tricks and on the other hand preserves at least 
Num. The main idea to achieve this goal is to force the learner to cope with 
several images 0{S) at the same time while keeping these operators restrictive 
enough to preserve at least learnability by enumeration. So, given any S, 

let [S] = {0e{f) ■ e = 0,1,... and f G S} denote the closure of S 
under all primitive recursive operators 6>o, 0i, . . . and define that S is 
hyperrobustly l^x-learnable iff [S'] is Ex-learnable. 
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Theorem [^justifies this definition since it shows that the hyperrobustly learnable 
classes remain the same if one takes any larger enumerable class of operators 
instead of the one above. Furthermore, the notion of hyperrobust learning is 
compatible to standard robust learnability as used in HHI: if S is hyperrobustly 
learnable then S is also robustly learnable. Moreover, if S is closed under finite 
variants then both notions are equivalent. The set [S'] is dense for every nonempty 
class S of functions, thus, hyperrobust learning cannot respect any bounds on 
the number of mind changes. Therefore, it is not suitable to look at mind change 
complexity in the context of hyperrobust learning and so, this paper focuses on 
the notions Num, Ex, BC and teams of Ex-learners or BC-learners. 

This new notion of hyperrobust learning has also a further, more intuitive, 
motivation: Assume that a learner M can learn all axis-parallel rectangles in 
the plane. Certainly, one assumes that from M one can build a learner which 
additionally infers all rotated rectangles. However, clearly one does not want 
to build, for every different rotation 0, a learner succeeding just on rectangles 
mapped by the rotation 0. But instead one is interested in a learner which 
learns every image of any axis-parallel rectangle under any rotation 0. The 
notion of hyperrobustness reflects this situation by requiring that one learning 
machine M learns every image of the functions in a class S under all primitive 
recursive operators. 



For the reader’s convenience, the definitions of Num, Ex and BC are included 
here: A learner M is a total recursive machine which receives as input initial 
segments cr of a total function / and outputs for every a a guess for a program 
which is intended to represent a rule generating the function /. M learns / 
iff almost all guesses are programs computing /; M learns a whole class S of 
functions iff M learns every f G S. The difference between the three criteria 
Num, Ex and BC is that a BC-learner need not satisfy any further requirements. 
But an Ex-learner has to converge explicitly, that is, for sufficiently large x 
the programs M(/(0)/(l) . . . /(a:)) have to be the same. Num contains every 
class which is a subclass of an enumerable family of total recursive functions. 
One can infer the classes in Num by an easy algorithm called “learning by 
enumeration”: the Ex-learner outputs always an index for the first function in 
the given family which is consistent with the data yet seen. “Num” stands for 
classes contained in a numbering. “Ex” stands for explanatory learning, that 
is, the learner converges to an explanation or program for /. “BC” stands for 
behaviourally correct learning, that is, the learner outputs almost always correct 
conjectures but the learner does not necessarily converge syntactically to one 
single program for /. 

The notions of robustness can directly be transferred from Ex to BC: S is 
hyperrobustly BC-learnable iff [S'] is BC-learnable. For the ease of notation, 
if a result holds for explanatory learning as well as for behaviourally correct 
learning, then just the notion “learnable” is used in place of Ex-learnable and 
BC-learnable, respectively. If a result holds only for one of these two notions, 
then this notion is mentioned explicitly. Of course, when using the simplified 
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notion “learnable”, one has always to replace every occurrence of “learnable” 
consistently by either “Ex-learnable” or “BC-learnable” , but one must not mix 
both notions. 

The interested reader can find background information on recursion theory 
in the book of Odifreddi El and on inductive inference in the book of Osherson, 
Stob and Weinstein m 

2 General Results 

Hyperrobust learning differs from robust learning in the way that a hyperrobust 
learner succeeds on a class of functions S only if it succeeds on all images of S 
under primitive recursive operators. On the other hand, in the case of robust 
learning, it is only required that for every general recursive operator there is a 
learner succeeding with the data translated by this single operator. 

Before defining robust and hyperrobust learning, the notions of operators are 
made more precise. An operator 0 is general recursive iff there is a program e 
such that, for every function / and every x, the program tpl computes 0{f){x) 
by accessing / via builtin oracle calls and terminates. So, 0 maps every total 
function to a total function. Furthermore, the family 6*o,0i, ... is enumerable 
iff there is a single general recursive operator 0 with 0e{f){x) = 0{f){e,x). 
On {0, l}-valued functions, truth-table operators and general recursive operators 
mapping every total function to a total one coincide. In the general case, they are 
different as the example of the operator 0{f){x) = f{f{x)) shows. Nevertheless, 
for the purposes of the present paper, it suffices to use the more restrictive variant 
and define primitive recursive operators as truth-table operators, that is, 0 is 
primitive recursive iff there are two primitive recursive functions g, h such that 

= g{x, /(0)/(l) . . . f{h{x))) 



for all / and x. 

Definition 1. Let S be a class of recursive functions. 

(a) S is robustly learnable iff, for every general recursive operator 0, there 
is a learner M which learns 0{S) = {0{f) '■ f G S}. 

(b) S is hyperrobustly learnable ijf there is one learner M which learns every 
function in [S'] = {6>e(/) : e = 0,1,... and f G S} where 0 q, 0\,... is an 
enumeration of all primitive recursive operators. 

Note that in this general definition, one can define robust and hyperrobust 
learnability for all common notions of learning, like explanatory learning, be- 
haviourally correct learning and so on. Hyperrobust learning satisfies two easy 
observations. 

Fact 1 (a) [S] contains all primitive recursive functions since for every primitive 
recursive function g there is a primitive recursive operator 0 with 0{f) = g for 
all functions f . 
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(b) No class S is hyperrobustly learnable with any bound on the number of 
mind changes, since [S'] is dense and dense classes cannot satisfy mind change 
bounds. 

These two facts establish a real difference to robust learning because there are 
classes of recursive functions which are even robustly learnable with at most one 
mind change 0. On the other hand, for hyperrobust learning, the notions Ex, 
BC and their team-versions are the most interesting ones. 

The definition of the mapping S — > [S] and thus, also the definition of hyper- 
robustness is based on the class of primitive recursive operators. The decision to 
choose the class of primitive recursive operators may seem to be just arbitrary 
and one may wonder how other choices for the class of operators effect the notion 
of hyperrobustness. The next two results justify the definition: First it is shown 
that every hyperrobustly learnable class is bounded in the following sense. 

Definition 2. A class S is bounded iff there is a total recursive function g which 
dominates every f S S: (V/ S S) (3x) (Vy > x) [f{y) < g(y)]. 

Second it is shown that if, in the definition of hyperrobust learning, the enu- 
meration 6 * 0 , 01 , . . . of all primitive recursive operators is replaced by a larger 
enumerable class of operators, then one still gets the same learning notion. 

Theorem 2. If S is hyperrobustly learnable then S is bounded. 

Proof. It is sufficient to show the result for the more powerful notion of hy- 
perrobust BC-learning. For function learning, it is convenient to represent the 
BC-learner as an NV"-learner M which predicts the function to be learned al- 
most everywhere but which may be undefined at finitely many places as well as 
on any invalid data m 

Now one defines inductively, for every /, the following function 9{f) = 
limj; (Ta; starting with ao = X and using Mg{ax) as a notation for the result 
of M{ax) after s computational steps (which is either undefined or M{ax))'. 

a 1 = /'^“^ if Ms((Ta:)i= 0 for some s < /(O) -b /(I) -b . . . -b /(x); 

\ 1 ^x 0 otherwise. 

Since 0 is a primitive recursive operator, M has to infer 0(/) for every f G S. 
But whenever 9{f){x) is 1, then M has made a prediction mistake and so, 9{f) 
takes only finitely often a value different from 0. Since M has to infer every 
primitive recursive function by FactQ, M learns in particular all functions of the 
form (t 0°“. Thus the following function g is recursive: 

g{x) = max{min{s : (3t < s) [Ms(cr0*) J, = 0]} : a S {0, 1}“}. 

Whenever f{x) > g{x) then one finds within f{x) stages some t < f{x) such 
that M((Ta;0‘) i= 0. So, the inductive definition of 0 diagonalizes for some 
y & {x,x -b 1, . . . ,x + f{x)} against the learner M, that is, Cy+i = ayl while 
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M(ay) = 0. Thus there exist at most finitely many x with f{x) > g(x) and, 
therefore, g dominates /. Since the construction of g does not depend on the 
actual choice of /, g dominates every function in S'. | 

The next result shows that one does not change the notion of hyperrobust learn- 
ing if one uses a more powerful enumerable family of general recursive operators 
instead of the family of all primitive recursive operators. As already mentioned 
above, this result provides an important justification of the model: the defini- 
tion of hyperrobust learning does not depend on the actual choice of the class 
of operators as long as this class is “sufficiently rich” (for example, if the class 
contains all primitive recursive operators, or, all polynomial time computable 
operators). Clearly, if the class of operators contains only the identity operator, 
then hyperrobust and ordinary Ex-learning coincide and so, “sufficiently rich” 
is a necessary and natural postulate. 

Theorem 3. If S is hyperrohustly learnable then S is also hyperrobustly learn- 
able with respect to any given enumerable family 0q, 0i, • ■ . of general recursive 
operators. 

Proof. The main idea is the following: there is a function h with primitive re- 
cursive graph such that the operator 0 given by 

f ^ ■ • ■ 

is primitive recursive and maps the function 0e{f) into [S'] for all / G S and 
every operator O^. Then any hyperrobust learner M also infers every function 
0{0e{f)) and can thus be translated into a learner succeeding on all functions 
0e{f) by ignoring the zeros pasted into 0e{f) by 0. 

So, the main part of the proof is to show that h exists and that the concate- 
nation 0(0 e) is primitive recursive for every operator 0^. 

Given S, there exists a recursive function g which dominates all / G S by 
Theorem El Now, g is used in order to define the desired function h: 

h(x) is the smallest number of computational steps s such that, for all 
y < a:, all e < a; and all functions / with f(z) < g(z)+x, the computation 
0e(f)(y) terminates within s steps. 

The function h is well-defined since every operator 0f, maps every total (not 
necessarily recursive) function onto a total function. The verification that h is 
recursive uses similar ideas like the proof of the folklore result that a Turing 
reduction which gives a total function for every oracle can be turned effectively 
into a truth-table reduction. Furthermore, one can primitive recursively check 
whether some computation halts within s stages if s is a lower bound for the 
input. So the graph of h is primitive recursive. To compute 0(0e)(f)(x) one 
first checks whether x is of the form h(0) -I- 1 -I- h(l) -|- 1 h(y). If so, 

one can compute 0e(f)(y) within max{h(e) , h(y)} steps and output this value; 
in particular / is also only queried at places below max{/i(e), /i(y)}. Otherwise, 
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0(6>e(/))(a;) = 0 and neither any computations nor any queries are necessary. 
So, O{0e) is a primitive recursive operator. | 

Corollary 1. (a) IfS is hyperrobustly leamable then S is also robustly leamable 
in the sense that 0{S) is leamable for every general recursive operator 0. 

(b) IfS is closed under finite variants then S is hyperrobustly leamable iff S 
is robustly leamable. 

Proof. Part (a) and the identical direction (=>) of part (b) are a direct corollary 
of the preceding theorem. The direction (<i=) of part (b) is due to coding the index 
of 0e into the first argument of the function. Let e' = (e, /(O)) and define 

0(e7(l)/(2)...) = 0e(/(O)/(l)/(2)...) 

where 0 q, 0\, . . . is an enumeration of all primitive recursive operators. Then 0 
is a general recursive operator. Furthermore, [S'] = 0{S), since, for every f G S, 
the function 0e{f) is the image of the finite variant e' f{l)f{2) . . . which is also 
in S. So, [S] has to be leamable and S is hyperrobustly leamable. | 

Corollary Q] shows that hyperrobust learning is a natural generalization of robust 
learning: it is equivalent to first taking the closure under all finite variants and 
then applying a suitable general recursive operator 0. 

The notion of robustness was designed to destroy every numerical coding 
trick: for example, if f(2x) is a program for / for almost all x, then the gener- 
al recursive operator mapping / to /(l)/(3)/(5) . . . destroys this coding trick. 
Jain, Smith and Wiehagen [S| showed that topological coding — like, for exam- 
ple, coding the index into the branching points where the functions to be learned 
branch away from a fixed recursive function — cannot be destroyed by using a 
single general recursive operator. However, topological coding is destroyed by 
adding all finite variants of the functions in S to the class to be learned. Com- 
bining these two methods, that is, considering robust learning of classes closed 
under finite variants, one might hope that no coding tricks are left. Indeed, this 
hope is confirmed by the following characterization result which shows that the 
hyperrobustly Ex-learnable classes coincide with the classes in Num. 

The criterion Num is quite prominent — Barzdins, Leeuwen and Zeugman- 
n pni showed that it coincides with further natural criteria: PEx-learning where 
the learner outputs only programs of total functions; NV-learning where the 
learner is total and predicts every f G S almost everywhere correctly; robust- 
ly totally reliable Ex-learning where a totally reliable Ex-learner either infers a 
function or diverges on it. Minicozzi j!^PTn] introduced the notion of reliable learn- 
ing; the difference between reliable Ex-learning and totally reliable Ex-learning is 
that in the second case the learner has also to diverge on nonrecursive function- 
s while an ordinary reliable Ex-learner may behave arbitrarily on nonrecursive 
functions. 

The next result adds hyperrobust learning to this list of characterizations 
of Num. So, every hyperrobust leamable class S can be learned by enumeration 
where the learner always outputs an index for the first recursive function, from 
a list of total recursive functions, which is consistent with the data seen so far. 
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Theorem 4. A class S is hyperrobustly Yrx.-learnable iff it is in Num. 

Proof. One direction is straightforward: If S is in Num, so is [S']. That is, if 
S C {/o,/i,...} for an enumeration /o,/i,... of total functions then [S] is 
contained in the enumeration of all ^e.e' = 0e'(/e) where Oq,0i, ... is an enu- 
meration of all primitive recursive operators. 

The converse direction is proven similar to Theorem 0 Assume that M 
Ex-learns [S] . Then M infers all functions of the form a0°° . Since M is an Ex- 
learner, one knows that, for every a, either is defined for the first 

value X ^ dom((r) or M(cr0‘) yf M{a) for almost all t. Let g be again a strict- 
ly increasing recursive function dominating every f G S. Using g one defines 
inductively a function h with primitive recursive graph such that, for every 
(T G {0, 1, . . . , g(^x)}^{0) + l + h(l) + l + ... + h(x-l)^ (h(0) + 1 + h(l) + 

1 -I- ... -I- h{x — 1)) has converged or M (ayO^^^^) M{a) for all y < g{x). The 

function h is total since M has to infer every function which is almost every- 
where 0. Now one defines again a primitive recursive operator O by 

f ^ ■ • ■ 

and uses 0 to get an upper bound on the computation time of every f G S: The 
learner M learns 0{f) and its guesses M(0^^°^/(0)0^^^^/(l)0^*^^^/(2) . . . 
are, for almost all x, equal to some value e. From the definition of h it follows 
that the computation (pe{H^) + 1 + ^( 1 ) + 1 + ^*(2) -I- 1 -I- ... -I- h{x)) converges 
within h{x + 1) steps to f{x) for almost all x. Thus the function x — *■ h(x + 1) 
dominates the computation time for all f G S, which implies that the class S 
is in Num. | 

3 Hyperrobust BC-Learning is not Trivial 

Within this section, it is shown that hyperrobust BC-learning does not collapse 
to Num as hyperrobust Ex-learning and attempts are made to characterize hy- 
perrobust BC. A major tool in this research is the use of recursively bounded 
recursive trees m page 509], just called bounded recursive trees from now on. 
These trees are a generalization of binary recursive trees: for a bounded recur- 
sive tree T one can compute for every a G T a, complete list of the immediate 
successors in T which is impossible in the general case, even if cr has only finitely 
many successors. But it is still true when a recursive function bounds the size of 
the successors, that is, whenever aa G T then a < 6(lcrj) for some fixed recursive 
function b. So, one can define a bounded recursive tree as a recursive function c 
which associates with every cr G T a finite and explicit list of all nodes aa G T. 
If c is primitive recursive then T is called a bounded primitive recursive tree. 

A learning machine M is said to be reliable if M either converges to a correct 
program for the input function, or outputs infinitely often a signal for divergence, 
which, in the case of Ex- learning, can just be a mind change. Producing seman- 
tic mind changes alone is not sufficient to get a reliable version of BC-learning 
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that differs from ordinary BC, as the following fact shows. This fact is based 
on two observations: First, behavioural correct learners can be made consistent. 
That is, the new consistent learner outputs for every input a hypothesis which 
is correct on the data seen so far m- Second, consistent learners either con- 
verge semantically to the desired function or make infinitely many semantic mind 
changes. 

Fact 5 For every JiC-learnable class S there is a BC-feorner which either con- 
verges semantically or makes infinitely many semantic mind changes. 

Proof. A given BC-learner M for S can be easily transformed into a new BC-learn- 
er N such that 



, , J cf{x) ii X G dom(cr); 

~ [Pm ( a-) (x) otherwise. 

If M learns a function /, so does N since N changes the guess of M only on 
already known arguments by using the given values. If N semantically converges 
on / and almost always outputs some program of a fixed function if then ip = f: 
for every x, there is a cr ^ / such that N{a) computes ip and x G dom((r). It 
follows that tp{x) 1= a(x) = f(x). So, N learns a function / iff iV converges 
semantically on /. | 

Looking a bit closer, it even holds that N learns a function / iff output- 
s infinitely often the same program during the inference of /. Therefore, the 
analogue of reliable learning for BC must signal divergence more explicitly. A 
suitable definition is the following: The reliable BC-learner indicates divergence 
either by outputting a special value like “?” or by making a definitely wrong 
prediction where the underlying BC-learner is given by an NV"-prediction ma- 
chine M [SCSI, which, by definition, is successful on / if 

(V/ G 5) (V“a:) [M(/(0)/(l) . . . f{x)) [=f{x + 1)]. 

A more restrictive variant is totally reliable learning pimj where the learner has 
to signal divergence not only on all recursive functions not learned but also on 
all nonrecursive function which cannot be learned by definition. 

Definition 3. M is a reliable JiC-learner iff, for every recursive function f , 
either M BC-learns f by predicting almost always the correct value [that is, for 
almost all x, M(/(0)/(l) . . . f[x)) [ = f[x-\- 1)) or M diverges on f by infinitely 
often outputting either 1 or a defined but wrong prediction. M is a totally reliable 
BC -learner if M diverges also on every nonrecursive function. 

Zeugmann HH observed that robustly totally reliably Ex-learnable classes are 
just those in Num. A related result is that, for bounded classes, Num is equal to 
totally reliable Ex. Together with Theorem E[ one obtains the following equiva- 
lence. 
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Fact 6 For a bounded class S the following statements are equivalent. 

(a) S is in Num. 

(b) S is totally reliably Fx-learnable. 

(c) S is hyperrobustly Fx-learnable. 

The central question of this section is to what extent the equivalence above trans- 
fers to BC. The next characterization of bounded totally reliably BC-learnable 
classes is an important tool to attack this question. 

Theorem 7. A bounded class S is totally reliably HC-learnable iff there is a 
family Tq,Ti, ... of bounded recursive trees such that every tree has only finitely 
many infinite branches and every f € S is the infinite branch of such a tree. 

Proof. (=^): Assume that g bounds S and M is an NV"-predictor for S which 
in addition signals infinitely often divergence on every function / which M does 
not learn. Now let the tree contain all prefixes of a plus all 77 ^ ct such that 

— rj{x) < g{x) for all x G dom( 77 ) — dom(cr) and 

— there are no r, a with a ^ ra ^ rj and M|^|(r) i yf a 

where, of course, the special symbol “?” is different from a. 

Clearly, the form a recursive family of trees bounded by g. Assume now 
that Ter has infinitely many infinite branches. As a consequence of Konig’s Lem- 
ma and the fact that To- is finitely branching, there is an infinite branch / which 
is not isolated. This implies that M cannot predict / at almost all points cor- 
rectly. So, on input /, divergence must also be signaled above cr by M, which 
contradicts the fact that / is an infinite branch of T^. 

For every f G S, there is a prefix <J < f such that M predicts / correctly 
after seeing a and all x with f{x) > g{x) are in dom(cr). Then it follows from 
the definition that / is an infinite branch of and direction (=^) is completed. 

(■t=): Let To, Ti, . . . be a family of bounded recursive trees such that every tree 
has only finitely many infinite branches and every function in S is branch of such 
a tree. Without loss of generality, the family is dense in the sense that for every 
cr there is a tree containing a. This can be achieved by adding all finite trees of 
the form {t : r ^ cr} to the list. The new family is still enumerable and the class 
of functions on trees in the family remains the same. Let T[r] denote all nodes 
of the tree T which are comparable to r. Now the totally reliable BC-learner 
works as follows: 

M(cr) finds the first tree Tg with cr G Tg. 

If there was a recent change of the tree, that is, if there is e' < e with 
r G Tg/ for all r ^ cr then M{a) = ? in order to signal divergence. 

Otherwise M{a) searches for an a such that Tg[cr 6 ] is finite for all b ^ a 
and M{a) = a if such an a is found. 
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The first step of the algorithm is well-defined since every <j is node of some 
tree Tg. 

If / is not infinite branch of any tree then, during the inference of /, M 
signals infinitely often divergence, since M has infinitely often to change the tree. 

If / is infinite branch of some tree then there is a first such tree Te in the 
enumeration. For sufficiently large a = /(0)/(l) . . - fix), f is the only infinite 
branch of Tg[o] and a ^ Tg/ for any e' < e. Now f{x + 1) is the unique value a 
with Te[cra] being infinite. Since the trees Tg are uniformly bounded recursive, 
a suitable search algorithm finds the value f{x + 1). Therefore, M predicts / 
almost everywhere, that is, M(/(0)/(l) . . . f{x)) | = f{x + 1) for almost all x. 

So, for every function /, M either BC-learns / (in the prediction model) or 
M signals infinitely often divergence. | 

The next theorem establishes some compatibility between the various notions of 
reliable learning. It shows that reliable Ex-learning and totally reliable BC-learn- 
ing are generalizations of totally reliable Ex-learning in two disjoint directions. 

Theorem 8. A class S is totally reliably l^x-learnable iff S is reliably Fix-learn- 
able and totally reliably JiC-learnable. 

Proof. The direction (=>) is clear, for the converse direction (<;=) note that one 
can count in the limit the number of signals for divergence. So, there is a recursive 
function H such that H converges on / to c, if the totally reliable BC-learner, 
on input /, signals exactly c times divergence, and such that H converges on / 
to oo, if the totally reliable BC-learner signals infinitely often divergence. Let M 
be a reliable Ex-learner and let pad be an injective padding function. Now the 
new learner N is given by 



N{a) = pad{M (a) , H (a)) . 

If / S S' then N converges to pad{e, c) where e is the program to which M 
converges on / and c is the finite number of (false) signals for divergence pro- 
duced the totally reliable BC-learner for S on /. If N converges on /, then H 
converges on / to a finite number and thus / is recursive. Furthermore, also M 
must converge to a program e and since M is reliable and / is recursive, pad{e, c) 
is a program for /. Thus N is totally reliable. | 

Case, Kaufmann, Kinber and Kummer ^ showed that there is a family of binary 
recursive trees of width 2 whose infinite branches are not Ex-learnable. This 
yields a class S which is totally reliably BC-learnable but not reliably Ex-learn- 
able. For bounded classes, the concept of totally reliable BC-learning is also a 
proper generalization of Num. Together with the next result one obtains that the 
three notions from FactElbecome all different for BC: Num is properly included 
in bounded totally reliable BC, which is properly included in hyperrobust BC. 

Theorem 9. If S is a bounded and totally reliably HC-leamable class then S is 
also hyperrobustly HC-learnable. But the converse implication does not hold. 
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Proof. For the first statement, let S be bounded and totally reliably BC-learn- 
able. There is a uniformly recursive family Tq,Ti, . . . oi trees such that every tree 
has only finitely many infinite branches and every function in S is infinite branch 
of some tree Tg. Without loss of generality, one can assume that for every e and 
every primitive recursive operator 0 there is some e' such that Tg/ = &(Te). 
Otherwise, To,Ti, . . . can be replaced by another uniformly recursive family of 
trees which contains the trees Ti and which is closed under all primitive recursive 
operators. So, whenever / S 5” and 0 is a primitive recursive operator then there 
is some tree such that / is infinite branch of Tg and also of some further tree 
Te' = 0{Te). This implies that 0(/) is an infinite branch of Tg/. So, the class 
S' of all infinite branches of Tq,Ti,... contains [S']. By Theorem Q there is a 
totally reliable BC-learner M for S'. Now M is also a BC-learner for [S] and, 
thus, already a hyperrobust BC-learner for S. 

The second statement can be proven using the following idea: one constructs a 
family of binary trees Tq, Ti, . . . such that, for every tree Tg and every primitive 
recursive operator Oi, the infinite branches of the image tree OiiTg) are either 
isolated or nonrecursive. Furthermore, Tg diagonalizes against the learner Mg 
from an enumerable list Mg, M\, ... of all learners in such a way that whenever 
Mg is a totally reliable BC-learner then Tg has only one infinite branch on which 
Mg infinitely often signals divergence. The class 

S = {0i(/) : 0i(/) is recursive and / is on Tg for some i,e\ 

witnesses the separation. The details of the construction of the Tg and the verifi- 
cation that S is hyperrobustly BC-learnable but not totally reliable BC-learnable 
are omitted due to space constraints. | 

Zeugmann [H! showed that a totally reliably Ex-learnable class S is also robustly 
learnable under this criterion iff S is in Num. One can deduce from it that S is 
robustly totally reliably Ex-learnable iff S is bounded. The same holds for robust 
totally reliably BC-learnability based on Zeugmann’s diagonalization strategy for 
unbounded classes S and on an adaptation of the implication “bounded totally 
reliable BC => hyperrobust BC” for bounded classes S. 

Theorem 10. A totally reliably JiC-learnable class S is also robustly learnable 
under this criterion iff S is bounded. 



4 Team-Learning and the Union-Theorem 

Barzdins P2| showed that there are explanatory learnable classes 5i , S 2 such 
that their union is not explanatorily learnable. This result easily generalizes to 
the fact that there are unions of n -I- 1 learnable classes which are not contained 
in the union of n learnable classes m- Pitt and Smith showed, that these 

unions can also be characterized in terms of probabilistic learners and teams of 
learners, that is, the following statements are equivalent for Ex-learning as well 
as for BC-learning. 
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— S' is contained in the union of n learnable classes. 

— Some probablistic machine learns S with some probability p where p > 

— A {h, /c)-team with ^ learns S in the sense that there are k learners 

such that, for every / S S, at least h of them learn /. 

Note that the probability and the fraction ^ of successful machines in the team 
have to be really greater than and cannot be equal to this value, since a 
team of fc = h • (n + 1) learners, where h learners have to succeed, can already 
infer the union of n + 1 learnable classes: the first h learners follow the algorithm 
to learn Si, the second h learners follow the algorithm to learn S2, ■ • the last 
h learners follow the algorithm to learn Sn+i- 

For hyperrobust Ex-learning, one can show that this connection between 
team-learning on the one side and unions on the other side does no longer hold. 
The hyperrobustly Ex- learnable classes are closed under union but teams of n-l-1 
hyperrobust Ex-learners are more powerful than teams of n learners. An intuitive 
explanation for this fact is that if U S' 2 ] needs a team of two Ex-learners then 
so does [^i] or [S' 2 ]. So, the closure operation does not permit to split a class of 
functions into two classes which are really easier to learn. 



Fact 11 If Si and S 2 are hyperrobustly Hix-learnable, so is Si U S 2 . 

The result follows from the equivalence of hyperrobust Ex-learning and Num and 
from the fact that Num is closed under union. The next result establishes that 
the team hierarchies for hyperrobust Ex-learning and hyperrobust BC-learning 
are proper. 



Theorem 12. The team hierarchy for hyperrobust learning is proper. 



Proof. Let Sk be the set of all functions which are infinite branch of some bound- 
ed primitive recursive tree of rank up to fc. 

Given /, the learning algorithm first finds (in the limit) a tree T such that 
/ is an infinite branch of T. Having found this tree T, one uses the algorith- 
m of Case, Kaufmann, Kinber and Kummer ^ who showed that knowing an 
index of the tree and having a primitive recursive function majorizing all infi- 
nite branches, one can learn the function by a team of fc -|- 1 Ex-learners or fc 
BC-learners, respectively. The team-size is also optimal. The class Sk is closed, 
that is, [Sk] = Sk. So, it follows that Sk is learnable by a team of hyperrobust 
learners of size fc (BC) and fc-l- 1 (Ex), respectively, but not by a smaller team. | 

Furthermore, for hyperrobust Ex-learning, one can even show that there exists 
a proper team hierarchy within the class of all hyperrobustly BC-learnable func- 
tions. The n-th level of this hierarchy is given by the class of all infinite branches 
of bounded primitive recursive trees of width up to n, that is, of trees which 
have in every depth at most n nodes. 
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5 Conclusion 

The research on robust learning has the goal to investigate whether there are 
learning notions which make it impossible to learn a function by evaluating 
self-referential coding-information in the graph of the function. The previous 
approaches to consider all classes 0{S) either still allowed some topological kind 
of coding | I7I8I or permitted partial operators which already destruct the basic 
algorithm “learning by enumeration” . The authors believe that such a basic 
algorithm should be preserved and therefore propose a new approach: the learner 
has to deal with all images of general recursive operators 0{S) simultaneously. 
The collection of operators used must nevertheless be restricted since permitting 
all operators would mean to postulate the learning of all recursive functions. 
It is shown that using all primitive recursive operators is a reasonable choice. 
In particular, the following two results justify this notion: first, all sufficiently 
powerful families of operators give the same notion of learning; second, a class S 
is hyperrobustly learnable with respect to this choice of operators iff the closure 
of S under finite variants is robustly learnable with respect to the traditional 
definition. Hyperrobust Ex-learning meets Barzdins’ hypothesis since it collapses 
to Num. But the hyperrobust versions of BC-learning and team-learning permit 
the inference of classes outside Num. There are relations between hyperrobust 
BC-learning and totally reliable BC-learning. Furthermore, families of bounded 
recursive trees turn out to be a useful tool for investigating hyperrobust learning 
and for characterizing totally reliable BC-learning of bounded classes. 

Acknowledgment Frank Stephan is supported by the Deutsche Forschungs- 
gemeinschaft (DFG) grant Am 60/9-2. 



References 

1. Janis Barzdins. Two theorems on the limiting synthesis of functions. In Theory of 
Algorithms and Programs, Latvian State University, Riga, 210:82-88, 1974. 

2. Leonard Blum and Manuel Blum. Towards a mathematical theory of inductive 
inference. Information and Control, 28:125-155, 1975. 

3. John Case, Sanjay Jain, Matthias Ott, Arun Sharma and Frank Stephan. Robust 
learning aided by context. In Proeeedings of Eleventh Annual Conference on Com- 
putational Learning Theory (COLT), pages 44-55, ACM Press, New York, 1998. 

4. John Case, Susanne Kaufmann, Efim Kinber and Martin Kummer. Learning re- 
cursive functions from approximations. Journal of Computer and System Sciences, 
55:183-196, 1997. 

5. John Case and Carl Smith. Comparison of identification criteria for machine in- 
ductive inference. Theoretical Computer Science, 25:193-220, 1983. 

6. Jerome Feldmann. Some decidability results on grammatical inference and com- 
plexity. Information and Control, 20:244-262, 1972. 

7. Marc Fulk. Robust separations in inductive inference. In Proceedings of the 31st 
Annual Symposium on Foundations of Computer Science (FOCS), pages 405-410, 
St. Louis, Missouri, 1990. 



Avoiding Coding Tricks by Hyperrobust Learning 197 



8. Sanjay Jain, Carl Smith and Rolf Wiehagen. On the power of learning robustly. 
In Proceedings of Eleventh Annual Conference on Computational Learning Theory 
(COLT), pages 187-197, ACM Press, New York, 1998. 

9. Wolfgang Merkle and Frank Stephan. Trees and learning. Proceedings of the Ninth 
Annual Conference on Computational Learning Theory (COLT), pages 270-279, 
ACM Press, New York, 1996. 

10. Eliana Minicozzi. Some natural properties of strong-identification in inductive 
inference. Theoretical Computer Science, 2:345-360, 1976. 

11. Piergiorgio Odifreddi. Classical Recursion Theory. North-Holland, Amsterdam, 
1989. 

12. Daniel Osherson, Michael Stob and Scott Weinstein. Systems that Learn. MIT 
Press, Cambridge, Massachusetts, 1986. 

13. Lenny Pitt. Probablistic inductive inference. Journal of the Association of Com- 
puting Machinery, 36:383-433, 1989. 

14. Lenny Pitt and Carl Smith. Probability and plurality for aggregations of learning 
machines. Information and Computation, 77:77-92, 1998. 

15. Karlis Podnieks. Comparing various concepts of function prediction. Part 1. Theory 
of Algorithms and Programs, Latvian State University, Riga, 210:68-81, 1974. 

16. Carl Smith. The power of pluralism for automatic program synthesis Journal of 
the Association of Computing Machinery, 29:1144-1165, 1982. 

17. Thomas Zeugmann. On Barzdins’ conjecture. In K. P. Jantke, editor. Proceedings 
of the International Workshop on Analogical and Inductive Inference (AII’86), vol- 
ume 265 of LNCS, pages 220-227. Springer, 1986. 




Mind Change Complexity of Learning Logic 

Programs * 



Sanjay Jain^ and Arun Sharma^ 

^ School of Computing 
National University of Singapore 
Singapore 119260, Republic of Singapore 
Email: sanjay@comp.nus . edu. sg 
^ School of Computer Science and Engineering 
The University of New South Wales 
Sydney, NSW 2052, Australia 
Email: arun@cse.unsw.edu.au 



Abstract. The present paper motivates the study of mind change com- 
plexity for learning minimal models of length-bounded logic programs. 

It establishes ordinal mind change complexity bounds for learnability of 
these classes both from positive facts and from positive and negative 
facts. 

Building on Angluin’s notion of finite thickness and Wright’s work on fi- 
nite elasticity, Shinohara defined the property of bounded finite thickness 
to give a sufficient condition for learnability of indexed families of com- 
putable languages from positive data. This paper shows that an effective 
version of Shinohara’s notion of bounded finite thickness gives sufficient 
conditions for learnability with ordinal mind change bound, both in the 
context of learnability from positive data and for learnability from com- 
plete (both positive and negative) data. 

More precisely, it is shown that if a language defining framework yields a 
uniformly decidable family of languages and has effective bounded finite 
thickness, then for each natural number m > 0, the class of languages 
defined by formal systems of length < m: 

— is identifiable in the limit from positive data with a mind change 
bound of oi"*; 

— is identifiable in the limit from both positive and negative data with 
an ordinal mind change bound of a; x m. 

The above sufficient conditions are employed to give an ordinal mind 
change bound for learnability of minimal models of various classes of 
length-bounded Prolog programs, including Shapiro’s linear programs, 
Arimura and Shinohara’s depth-bounded linearly-covering programs, and 
Krishna Rao’s depth-bounded linearly-moded programs. It is also noted 
that the bound for learning from positive data is tight for the example 
classes considered. 
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A49803051. 



P. Fischer and H.U. Simon (Eds.): EuroCOLT’99, LNAI 1572, pp. 198-|^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



Mind Change Complexity of Learning Logic Programs 199 



1 Motivation and Introduction 

Machine learning in the context of first-order logic and its subclasses can be 
traced back to the work of Plotkin |Plo71j and Shapiro |Sha81| . In recent years, 
this work has evolved into the very active field of Inductive Logic Programming 
(ILP). Numerous practical systems have been built to demonstrate the feasibil- 
ity of learning logic programs as descriptions of complex concepts. The utility 
of these systems has been demonstrated in many domains including drug de- 
sign, protein secondary structure prediction, and finite element mesh design (see 
Muggleton and DeRaedt |MDP,94| . Lavrac and Dzeroski CHMl, Bergadano and 
Gunetti mm . and Nienhuys-Cheng and de Wolf |NCdW97| for a survey of 
this field). 

Together with practical developments, there has also been some interest in de- 
riving learnability results for ILP. Several results in the PAG setting have been 
established; we refer the reader to Dzeroski, Muggleton, and Russell !l)Mfih‘ia.| 
|DMP,92bj, Gohe n fGoh95alGoh95bl Goh95 cj . De Raedt and Dzeroski prumj , 
Haussler |H^ - Frisch and Page pPDIj .' Yamamoto !Va,m93|. Kietz |Kie98j. 
and Maass and Turan !mt^. 

Insights about which classes of logic programs are suitable as hypothesis spaces 
from the learnability perspective are likely to be very useful to ILP. Unfortunate- 
ly, the few positive results that have been demonstrated in the PAG setting are 
for very restricted classes of logic programs. Hence, it is useful to consider more 
general models to analyze learnability of logic programs0 In the present paper, 
we develop tools to investigate identifiability in the limit with “mind change 
bounds” of minimal models of logic programs. 

The first identification in the limit result about learnability of logic programs 
is due to Shapiro mm . He showed that the class of h-easy models is iden- 
tifiable in the limit from both positive and negative facts. Adapting the work 
on learnability of subclasses of elementary formal systemfl Shinohara 
showed that the class of minimal models of linear Prolog programs consisting of 
at most m clauses is identifiable in the limit from only positive facts. Unfortu- 
nately, linear logic programs are very restricted as they do not even allow local 
variables (i.e., each variable in the body must appear in the head). Arimura and 
Shinohara introduced a class of linearly -covering logic programs that al- 

lows local variables in a restricted sense. They showed that the class of minimal 

^ The learnability analysis of ILP in the learning by query model is able to overcome 
some of the restrictive nature of the PAG model by allowing the learner queries to 
an oracle. For examples of such analysis, see Khardon |K ha,9S) and Krishna-Rao and 
Sattar [KESHB]. 

^ Arikawa EH201 adapted Smullyan’s |Smu61j elementary formal systems (EFS) for 
investigation of formal languages. Later, Arikawa et al. showed that EFS can 

be viewed as a logic programming language over strings. Recently, various subclasses 
of EFS have been investigated in the context of learnability (e.g., see Shinohara 

IShi9ilShi94h . 
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models of linearly-covering Prolog programs consisting of at most m clauses of 
bounded body length is identifiable in the limit from only positive facts. Krishna 
Rao noted that the class of linearly-covering programs is very restrictive 

as it did not even include the standard programs for reverse, merge, split, 
partition, quick-sort, and merge-sort. He proposed the class of linearly- 
moded programs that included all these standard programs and showed the class 
of minimal models of such programs consisting of at most m clauses of bounded 
body length to be identifiable in the limit from positive facts. 

While the above results are positive, it may be argued that the model is too gen- 
eral as the number of mind changes allowed is unbounded. Some authors have 
considered a polynomial time bound on the update of hypotheses in the identifi- 
cation in the limit setting. However, this restriction may not be very meaningful 
if the number of mind changes (and consequently the number of updates) is 
unbounded. The present paper considers learnability of logic programs in the 
identification in the limit setting with a bound on the mind change complex- 
ity. Recently, a number of approaches for modeling mind change bounds have 
been proposed |FS93ljS97lA,tS97lAmb95lSSV97lAF5M| . In the present paper, 
we employ constructive ordinals as bounds for mind changes. We illustrate this 
notion in the context of identification in the limit of languages from positive 
data. 

TxtEx denotes the collection of language classes that can be identified in the 
limit from positive data. An obvious approach to bounding the number of mind 
changes is to require that the learning machine make no more than a constant 
number of mind changes. This approach of employing natural numbers as mind 
change bounds was first considered by Barzdins and Podnieks pP73| (see also 
Case and Smith |CS88j b For each natural number m, TxtEx^ denotes the set 
of language classes that can be identified in the limit from positive data with no 
more than m mind changes. However, a constant mind change bound has several 
drawbacks: 

— it places the same bound on each language in the class irrespective of its 
“complexity” ; 

— it does not take into account scenarios in which a learner, after examining an 
element of the language, is in a position to issue a bound on the number of 
mind changes (i.e., the learner computes and updates mind change bounds 
based on the incoming data). 

To model situations where a mind change bound can be derived from data and 
updated as more data becomes available, constructive ordinals have been em- 
ployed as mind change counters by Freivalds and Smith lESaBl, and by Jain and 
Sharma jisnz!. We describe this notion next. 

TxtExc denotes the set of language classes that can be identified in the limit 
from positive data with an ordinal mind change bound a. We illustrate the 
interpretation of this notion with a few examples. Let w denote a notation for 
the least limit ordinal. Then a mind change bound of a A w is the earlier notion 
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of mind change identification where the bound is a natural number. For a = u>, 
TxtEX(^ denotes learnable classes for which there exists a machine that, after 
examining some element (s) of the language, can announce an upper bound on the 
number of mind changes it will make before the onset of successful convergence. 
Angluin’s |Ang80b|Ang80a| class of pattern languages is a member of TxtEx,^. 
Proceeding on, the class TxtExj^x 2 contains classes for which there is a learning 
machine that after examining some element (s) of the language announces an 
upper bound on the number of mind changes, but reserves the right to revise its 
upper bound once. TxtEx ,^2 contains classes for which the machine announces 
an upper bound on the number of times it may revise its conjectured upper 
bound on the number of mind changes, and so on. 

The notion of ordinal mind change bound has been employed to give learnabil- 
ity results for unions of pattern languages and subclasses of elementary formal 
systems (see psnzEisnzi). In the present paper, we generalize these results 
to establish two sufficient conditions for learnability with ordinal mind change 
bounds and apply the results to obtain mind change bounds for learning subclass- 
es of length-bounded logic programs. We discuss these two sufficient conditions 
briefly. 

Let U be an enumerable set of objects. A language is any subset of C7; a typical 
variable for languages is L. Let R be an enumerable set of rulet0. A finite subset 
of R is referred to as a formal system; a typical variable for formal systems is R. 
Let Lang be a mapping from the set of formal systems to languages^! We call 
the triple {U, R, Lang) a language defining framework. In the sequel, we only 
consider those language defining frameworks for which the class {Lang(T) | F 
is a finite subset of R} is a uniformly decidable family of computable languages. 
Furthermore, we suppose that a decision procedure for Lang(F) can be found 
effectively from F. 



A semantic mapping from formal systems to languages is monotonic just in case 
F C F' implies Lang(F) C Lang(F'). A formal system F is said to be reduced 
with respect to a finite X C U just in case X is contained in Lang(F) but 
not in any language defined by a proper subset of F. We assume, without loss 
of generality for this paper, that for all finite sets X C U, there exists a finite 
F F R, such that X C Lang(F). Building on the work of Angluin lAngSpy 
on finite thickness and of Wright on finite elasticity, Shinohara jShifll j 

defined a language defining framework to have bounded finite thickness just in 



case 



(a) it is monotonic and 

(b) for each finite X C U and for each natural number m > 0, the set of 
languages defined by formal systems that 

(i) are reduced with respect to X and 

(ii) that are of cardinality < m, 

® These could be productions in the context of formal languages or clauses in the 
context of logic programs. 

^ For technical convenience, we assume that Lang(0) = 0. 
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is finite. He showed that if a language defining framework has bounded finite 
thickness, then for each m > 0, the class of languages definable by formal systems 
of cardinality < m is identifiable in the limit from positive data. 

The present paper places a further requirement on Shinohara’s notion of bounded 
finite thickness to derive sufficient conditions for learnability with mind changes 
bounds. A language defining framework is said to have effective hounded finite 
thickness just in case the set of formal systems that are reduced with respect to 
X in the definition of bounded finite thickness can be obtained effectively in X. 
We show that the notion of effective bounded finite thickness gives an ordinal 
mind change bound for both learnability from positive data and for learnability 
from positive and negative data. In particular, we establish that if a language 
defining framework has effective bounded finite thickness, then for each natural 
number m > 0, the class of languages defined by formal systems of cardinality 
< to: 

— is identifiable in the limit from positive data with a mind change bound of 

— is identifiable in the limit from both positive and negative data with an 
ordinal mind change bound of w x to. 

We employ the above results to give mind change bounds for the following classes 
of Prolog programs: 

(a) Shapiro’s linear logic programs (similar result can be shown for the class 
of hereditary logic programs IMSS91IMSS98I and reductive logic programs 

[KM); 

(b) Krishna Rao’s linearly-moded logic programs with bounded body length 
(similar result holds for Arimura and Shinohara’s linearly-covering logic pro- 
grams with bounded body length issni!)- 

In the sequel we proceed as follows. Section 0 introduces the preliminaries of 
ordinal mind change identification. Section 0 establishes sufficient conditions for 
learnability with ordinal mind change bound for both positive data and positive 
and negative data. In Section 0 we introduce preliminaries of logic program- 
ming and apply the results from Section 0 to establish mind change bounds for 
learnability of minimal models of various subclasses of length-bounded Prolog 
programs. 



2 Ordinal Mind Change Identification 

N denotes the set of natural numbers, {0, 1,2, . . .}. Any unexplained recursion 
theoretic notation is from Cardinality of a set S is denoted card(S'). 

The maximum and minimum of a set are represented by max(-) and min(-), 
respectively. The symbols C,D,c,D, and 0 respectively stand for subset, su- 
perset, proper subset, proper superset, and the emptyset. A denotes the empty 
sequence. 
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Definition 1. A class of languages C = {Li \ i & N} is & uniformly decidable 
family of computable languages just in case there exists a computable function 
/ such that for each i G N and for each x G U, 



As noted in the introduction, we only consider uniformly decidable families of 
computable languages. In the next three definitions we introduce texts (positive 
data presentation), informants (positive and negative data presentation), and 
learning machines. 

Definition 2. KIolOTl 

(a) A text T is a mapping from N into U U {#}. (The symbol # models pauses 
in data presentation.) 

(b) content(T) denotes the intersection of U and the range of T. 

(c) A text T is for a language L iS L = content(T). 

(d) The initial sequence of text T of length n is denoted T[n]. 

(e) The set of all finite initial sequences of U and #’s is denoted SEQ. We let a 
and T range over SEQ. 



Definition 3. |(IolB7) 

(a) An informant I is an infinite sequence over U x {0,1} such that for each 
X gU either (x, 1) or (a;,0) (but not both) appear in the sequence. 

(b) An informant / is for L iff {x, 1) appears in / if a; C L and {x, 0) appears in 
I a X ^ L. 

(c) I[n] denotes the initial sequence of informant I with length n. 

(d) content(Z) = {{x,y) \ (x,y) appears in sequence /}; content(J[n]) is defined 
similarly. 

(e) Poslnfo(/[n]) = (a; | (a;, 1) G content(/[n])}; Neglnfo(/[n]) = (a; | (a;,0) G 
content(/[n])}. 

(f) InfSEQ = |/[n] | J is an informant for some L C U}. We let a and r also 
range over InfSEQ. 



Definition 4. Let T denote the set of all formal systems. 

(a) A learning machine from texts (informants) is an algorithmic mapping from 
SEQ (InfSEQ) into T U {?}. A typical variable for learning machines is M. 

(b) M is said to converge on text T to L (written: M(T) converges to F or 
M(T){ = r) just in case for all but finitely many n, M(T[n]) = F. A 
similar definition for informants holds. 
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A conjecture of “?” by a machine is interpreted as “no guess at this moment.” 
This is useful to avoid biasing the number of mind changes of a machine. For this 
paper, we assume, without loss of generality, that a C t and M((t) implies 
M(r) ^7. 

We next introduce ordinals as models for mind change counters. We assume a 
fixed notation system, O, and partial ordering of ordinal notations as used by, 
for example, Kleene lkle:fs|ft,ogb7ps"^I^ . and on ordinal notations 

below refer to the partial ordering of ordinal notations in this system. We do not 
go into the details of the notation system used, but instead refer the reader to 
fk le;js|Rogb7|Sa,c<l(l< In the sequel, we are somewhat informal and 

use +, X, and for all m G as notation for the same. 

Definition 5. F, an algorithmic mapping from SEQ (or InfSEQ) into ordi- 
nal notations, is an ordinal mind change counter function just in case (Vcr C 
r)[F(a)^F(r)]. 



Definition 6. 



latHiKi HSiliri 



Let a be an ordinal notation. 



(a) We say that M, with associated ordinal mind change counter function F, 
TxtExo,-zdentz/fes a text T just in case the following three conditions hold: 

(i) M(r)J, = r and Lang(F) = content(T), 

(ii) F(yl) = a, and 

(hi) (Vn)[? ^ M(T[n]) ^ M(T[n-h 1]) ^ F(T[n]) ^ F(T[n-h 1])]. 

(b) M, with associated ordinal mind change counter function F, TxtExc 
-identifies L (written: L G TxtExQ,(M, F)) just in case M, with associ- 
ated ordinal mind change counter function F, TxtExc-identifies each text 
for L. 

(c) TxtEx„ = {C I (3M,F)[/: C TxtEx„(M, F)]}. 



Definition 7. 



fSiBHrtlKBH 



Let a be an ordinal notation. 



(a) We say that M, with associated ordinal mind change counter function F, 
lnfE:x.a-identifies an informant I just in case the following three conditions 
hold: 

(i) M(/)i = r and Lang(r) = Poslnfo(/). 

(ii) F(yl) = a, and 

(hi) (Vn)[? M(/[n]) y^ M(J[n-P 1]) ^ F(/[n]) ^ F(/[n-P 1])]. 

(b) M, with associated ordinal mind change counter function F, InfExc, 
-identifies L (written: L G InfExQ(M, F)) just in case M, with associated 
ordinal mind change counter function F, InfExa-identifies each informant 
for L. 

(c) InfExa = {£ I (3M,F)[£ C InfEx„(M, F)]}. 



We refer the reader to Ambaims |Amh9,5l | for a discussion on how the learnability 
classes depend on the choice of the ordinal notation. 
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3 Conditions for learnability with mind change bonnd 

We now formally define what it means for a language defining framework to 
have the property of effective hounded finite thickness. Recall that a semantic 
mapping Lang is monotonic just in case for any two formal systems F and 
r' , r C r' ^ Lang(L) C Lang(_r'). Also, recall from the introduction that 
we only consider language defining frameworks that yield uniformly decidable 
families of computable languages. 

Definition 8. Let {U, R, Lang) be a language defining framework such that 
Lang is monotonic. For any finite X C U, let 

Genjc = {F I r C i? A card(F) < oo A A C Lang(F)} 

and let 

Minx {F G Genx | (VC G Genx)[C F]}. 

Then {U, R, Lang) is said to have effective bounded finite thickness just in case 
for all finite A C [/, Minx is finite and can be obtained effectively in A (i.e. there 
are recursive functions (in A) for enumerating Minx, and for finding cardinality 
of Minx). 



3.1 Learnability from positive data 

We now show that if a language defining framework has effective bounded finite 
thickness then the class of languages defined by formal systems of cardinality 
< m can be TxtExj^m-identified. This result is a generalization of a lemma 
from jjHnzi- To state this lemma, we need some technical machinery which we 
describe next. 

Definition 9. A search tree is a finite labeled rooted tree. We denote the label 
of node, v, in search tree F[ by Ch{v). 

Intuitively, the label on the nodes are interpreted as decision procedures. We 
abuse the notation slightly and by Lang(Cij (u)), we mean the language decided 
by Ch(v). We next introduce a partial order on search trees. 

Definition 10. Suppose Ffi and 712 are two search trees. We say that Ffi ^ H 2 
just in case the following properties are satisfied: 

(A) root of Ffi has the same label as root of Ff 2 ', 

(B) Ffi is a labeled subgraph of 772; and 

(C) all nodes of Hi, except the leaves, have exactly the same children in both 
Hi and 772. 
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Essentially, Hi < H 2 means that i /2 is obtained by attaching some (possibly 
empty) trees to some of the leaves of the search tree Hi . It is helpful to formalize 
the notion of depth of a search tree as follows: depth of root is 0; depth of a child 
is 1 + depth of parent; and depth of a search tree is depth of its deepest leaf. 

Q, a mapping from SEQ to search trees, is called an m-Explorer iff the following 
properties are satisfied: 

(A) (7 C T ^ Q{a) < Q{t); 

(B) (V(r)[depth(Q(o')) < m]; and 

00 

(C) for all r, Q{T)l i.e., (V n)[Q{T[n]) = Q(T[n + l])]. 

(The reader should note that C is actually implied by A and B; C has been 
included to emphasize the point.) 

We can now state the lemma from nwi that links the existence of an m-Explorer 
to TxtEx,jm-identification. 

Lemma 1. Suppose Q is an m-Explorer. Then there exists a machine M and 
an associated ordinal mind change counter F such that the following properties 
are satisfied: 

(A) (V texts T)[M(T)i]; 

(B) F(A) = w™; and 

(C) if there exists a node v in Q{T) such that Cq(t){v) is a decision procedure 
for content(T), then M, with associated mind change counter F, TxtEx^,m- 
identifies T. 

We now establish a theorem that bridges Lemma Q with the notion of effective 
bounded finite thickness and TxtEx,^m-identifiability. 

Theorem 1. Let (C/, /?, Lang) be a language defining framework with effective 
bounded finite thickness. For each m > 0, let 

{Lang(T) \ F C R A card(T) < m}. 

Then for each m > 0, there exists an m-Explorer Q such that for any text T for 
any L G there is a node v in Q(T) which is labelled by the decision procedure 
for content(T) = L. 

Proof. We construct an m-Explorer Q as follows. Let T be a text. Let Q{A) = 
just a root with label 0. Q{T[n -El]) is obtained as follows. For each leaf v in 
Q{T[n]) such that depth(u) < m and content (T [n -E 1]) ^ Lang(Cij(u)) do the 
following: 

For each F G Miri(.ontent(T[n-i-i])) add a child to v with label of the decision 
procedure for Lang(E). 

It is easy to verify that Q is an m-Explorer. 



I 
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3.2 Learnability from positive and negative data 

In this section we show that if a language defining framework has effective bound- 
ed finite thickness then the class of languages defined by formal systems of cardi- 
nality < m can be InfEx-identified with an ordinal mind change bound ofojxm. 
This result is a generalization of a result about unions of pattern languages from 
liOWI . We first introduce some technical machinery. 

Let Pos C U and Neg C U he two disjoint finite sets such that Pos yf 0. Then let 
^Pos.Neg drf {j. c i? I card(T) = i A [Pos C Lang(T)] A [Neg C C/-Lang(S')[}. 

The next lemma and corollary shed light on computation of 

Lemma 2. Let (C/, i?, Lang) be a language defining framework with effective 
hounded finite thickness. Let Pos yf 0 and Neg he two disjoint finite subsets of U 
and let i € N. Suppose (Vj < = 0]. Then, can be computed 

effectively in Pos, Neg, and i. (Note that must be finite in this case!) 

Proof. Let Pos, Neg, and i be as given in the hypothesis of the lemma. From 
the effective bounded finite thickness property of {U, R, Lang) it easily follows 

that for each x £ U, Hx = {P C R \ x £ Lang(T) A card(T) < oo A (VT' C 
r)[x ^ Lang(T')]} is finite and effective in x. 

Hof 

Let X = {PC R\ card(r) = i-k 1 A (Vs G Pos)(3Tj, G Hx)[Px C T[ A (NegC 
Lang(r) = 0)}. 

It is easy to verify that X = Also, since Hx is finite and effective in x 

for each x £ U, X is finite and can be obtained effectively from Pos, Neg and 

I 



Corollary 1. Let Pos yf 0 and Neg be two disjoint finite subsets ofU. Then there 
exists an i, effective in Pos and Neg, such that i = min({j | y^ 0}). 

Proof. Note that Lang(0) is empty. The corollary now follows by repeated use 
of Lemma 0until one finds an i such that yf 0. | 



Theorem 2. Let {U, R, Lang) be a language defining framework with effective 
bounded finite thickness. For each m > 0, let 

{Lang(T) | T C i? A card(T) < m}. 



Then (Vm > 0)[£™ G InfEx^^xm]- 
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Proof. Fix m. Let / be an informant. Then for n G N, M(/[n]) and F(/[n]) are 
defined as follows. 

Let Pos = Poslnfo(/[n]) and Neg = Neglnfo(/[n]). 

If Pos = 0, then M(/[n]) =? and F(/[n]) = to x m. 

If Pos 0, then let j = min({/ | 0}). Note that j (and corresponding 

^ Pos, Neg ^ can be found effectively in I[n], using Corollary QJ 

If j > m, then let M(/[n]) = M(/[n — 1]). 

If j < TO, then let M(/[n]) = P, where P is the lexicographically least element in 
^Pos,Neg^ and let F(/[n]) = lux k+i, where k = m—j, and i = card(JfJ“’'^®®) — 1. 

It is easy to verify that M, F witness the theorem. | 



4 Classes of logic programs 



We next describe the preliminaries from logic programming; the reader is referred 
to Lloyd |l Jo87| for any unexplained notation. 

Let n, be mutually disjoint sets such that PI and S are finite. PI is the 
set of predicate symbols, E is the set of function symbols, and X is the set of 
variables. The arity of a function or a predicate symbol p is denoted arity(p). 
The set of terms constructed from the function symbols in E and variables in X 
is denoted Terms(i7, fb). Atoms{II, E,X) denotes the set of atoms formed from 
predicate symbols in 7T and terms in Terms(i7, X). The set of ground atoms for 
a predicate symbol p, then is Atoms({p}, E, 0); we denote this set by B{p). The 
size of a term t, denoted |f|, is the number of symbols other than punctuation 
symbols in t. The body length of a definite clause is the number of literals in its 
body. The length of a logic program P, denoted Length(P), is just the number 
of clauses in P. 

Following the treatment of [KII96j . we take the least Herbrand model semantics 
of logic programs as our monotonic semantic mapping in the present paper. 
We will refer to the target predicate being learned by the symbol p. Then our 
language defining frameworks will be of the form {B{p), LP, Mp), where LP 
is the class of Prolog clauses being considered and Mp denotes the semantic 
mapping such that XIp{P) is the set of all atoms of the target predicate p in the 
least Herbrand model of P. 

We next describe linear Prolog programs introduced by Shapiro ISEiSIl. 

Definition 11. [ISha.Sl] A definite clause p{t\, . . . ,tn) ^ qi{s\^, . . . , ), . . ., 

qk{ski, . . . , Skn^) is called linear just in case for each i, 1 < i < k, \tia\ + • • • + 
|t„cr| > |sijCr| + • • • + |si^. tr| for any substitution tr. A logic program P is said to 
be linear just in case each clause in P is linear. 
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Theorem 3. [ShiOl | The class of least Herbrand models of linear Prolog pro- 
grams is a uniformly decidable family of computable languages. 

Let LC denote the class of all linear clauses and Mp be a semantic mapping such 
that Mp{P) is the set of all atoms of the target predicate p in the least Herbrand 
model of P. Then we have the following. 

Theorem 4. The language defining framework (B(p), LC, Mp) has effective 
bounded finite thickness. 

Proof. Shinohara’s proof of {B{p), LC, Mp) having bounded finite thickness can 
easily be modified to show that it is effective. | 

We note that a similar result can be shown for the class of hereditary logic 
programs |MSS91IMSS9;H and reductive logic programs [Kb 9ti| . 

The above results were for classes of logic programs that did not allow local 
variables. We now turn our attention to the classes of logic programs that al- 
low local variables. We show that the language defining frameworks associated 
with the class of linearly-covering Prolog prorgams of Arimura and Shinohara 
and the class of linearly-moded Prolog programs of Krishna Rao have effective 
bounded finite thickness if the body length of the clauses is bounded. Since the 
class of linearly-covering programs are subsumed by the class of linearly-moded 
programs, we show the result for only the latter class. But, first we introduce 
some terminology about multisets, parametric size of terms, and moded logic 
programs. 

For a multiset X and an object o, Occ{o,X) denotes the number of occurrences 
of o in A. The inclusion C between multisets X and Y is defined as X C Y 
just in case Occ{o,X) < Occ{o,Y) for any object o. The sum of two multisets 
X and Y, denoted A -|- T, is defined as the multiset for which Occ(o, X -\-Y) = 
Occ(o, A) -I- Occ(o, T) for any object o. Let ( ) denote an empty list. 

Definition 12. The parametric size of a term t, denoted Psize(t), is defined 
inductively as follows: 

(a) if t is a variable x then Psize(t) is the linear expression x; 

(b) if t is the empty list, then Psize(t) is 0; 

(c) if t = f{ti, . . . ,tn) and f G S — {{)}, then Psize(t) is the linear expression 
1 -I- Psize(ti) -I- • • • -I- Psize(t„)- 

We usually denote a sequence of terms t\, . . . ,tn by t. The parametric size of a 
sequence of terms ti, ... ,tn is the sum Psize(ti) -I- • • • -L Psize(t„). 

The definition of linearly-covering programs requires the notion of modes asso- 
ciated with each argument in a predicate. 
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Definition 13. (a) A mode declaration for an n-ary predicate p is a mapping 
from n} to the set (b) Let md be a mode declaration for the 

predicate p. Then the sets +{p) = {j \ md(j) = -|-} and — (p) = {j | md(j) = — } 
are the sets of input and output positions of p, respectively. 

If each predicate in a logic program has a unique mode declaration, the program 
is referred to as a moded program. In dealing with moded programs, it is useful 
to group together the input and output arguments, i.e., p(s;t) is an atom with 
input terms s and output terms t. 

The definition of linearly-moded logic programs requires the following technical 
notion. 

Definition 14. [IK K.hti) Let P be a moded logic program and let / be a mapping 
from the set of predicates occurring in P to sets of input positions such that 
I{p) ^ +(p) for each predicate p in P. Then for an atom A = p(s;t), the 
following linear inequality is denoted LI(A, /). 



A'ig/(p)Psize(sO > r^g_(p)Psize(tj). 



We now define Krishna Rao’s notion of what it means for a logic program to be 
linearly-moded . 

Definition 15. [IKP.96] 

(a) Let P be a moded logic program and let J be a mapping from the set of 
predicates in P to the sets of input positions satisfying /(p) C +{p) for each 
predicate p in P. P is said to be linearly-moded with respect to I if each 
clause 

Po{so; to) ^ Pi(si; ti), . . . ,pk{sk; tk) 
in P satisfies the following two conditions: 

(i) LI(Ai, /),..., LI(Aj_i, /) together imply Psize(so) > Psize(s^) for each 
j > 1, and 

(ii) LI(Ai, /),..., LI(Afc,/) together imply LI(Ao,/), 
where Aj is the atom pi{sj; tj) for each j > 0. 

(b) A logic program P is said to be linearly-moded just in case it is linearly- 
moded with respect to some mapping I . 

We now introduce the language defining framework of linearly-moded clauses. 
For k > 0, let LMCfe denote the set of all linearly-moded clauses of body length 
at most k. Then the language defining framework associated with linearly-moded 
clauses is {B{p), LMCfc, Mp). 

Theorem 5. [KR.96j For k > 1, the class of least Herbrand models of logic 
programs with clauses in LMCfc is an indexed family of recursive languages. 
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Theorem 6. For k > 1, the language defining framework (B(p), LMCfc, Mp) has 
effective hounded finite thickness. 

Proof. Krishna Rao’s proof of LMCfcjMp) having bounded finite 

thickness can easily be made effective. | 

As a consequence of the above theorem and the results in Section 0 for each 
TO > 1, for each fc > 1, the class of languages = {Mp{P) \ P e LMCfc A 
Length(P) < to} is a member of TxtExj^m and of InfEx^xm- The reader should 
note that the bound k on the body length of clauses is crucial for the effective 
bounded thickness property. It can be shown that without such a restriction 
the class of least Herbrand models of length-bounded linearly-moded programs 
contains a superfinite subclass, thereby ruling out its learnability from positive 
data. Krishna Rao fRTmni has shown that both the classes of linear clauses 
and the class of linearly-covering clauses is included is the class of linearly- 
moded clauses, but the classes of linear clauses and linearly-covering clauses are 
incomparable to each other. 



5 Conclusion 

A natural question is whether the bounds of w™ and w x to are tight. It can be 
shown for the example classes in this paper that for identification from positive 
data, the ordinal bound of w"* is tight. For identification from both positive and 
negative data, it is still open if the bound of w x to is tight. However, we can 
show an improvement on the bound ujxm under certain conditions if a restricted 
version of the language equivalence problem is decidable. In particular we can 
show that if for some fixed fc < to, the equivalence of Lang(F) and Lang(F') 
is decidable for card(T) = card}/^') < fc, then S 
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Abstract. Many settings of unsupervised learning can be viewed as 
quantization problems — the minimization of the expected quantization 
error subject to some restrictions. This allows the use of tools such as 
regularization from the theory of (supervised) risk minimization for un- 
supervised settings. Moreover, this setting is very closely related to both 
principal curves and the generative topographic map. 

We explore this connection in two ways: 1) we propose an algorithm for 
finding principal manifolds that can be regularized in a variety of ways. 
Experimental results demonstrate the feasibility of the approach. 2) We 
derive uniform convergence bounds and hence bounds on the learning 
rates of the algorithm. In particular, we give good bounds on the covering 
numbers which allows us to obtain a nearly optimal learning rate of order 
0 (m“ 2 +“) for certain types of regularization operators, where m is the 
sample size and a an arbitrary positive constant. 



1 Introduction 

The problems of unsupervised learning are much less precisely defined than 
those of supervised learning. Usually no explicit cost function exists by which 
the hypthesis can be compared with training data. Instead, one has to make 
assumptions on the data, with respect to which questions may be asked. 

A possible goal would be to look for reliable feature extractors, a setting that 
can be shown to lead to Kernel Principal Component Analysis [|j. Another 
option is to look for properties that represent the data best. This means leads 
to a descriptive model of the data (and possibly also a quite crude model of 
the underlying probability distribution). Principal Curves |^, the Generative 
Topographic Mapping , several linear Gaussian models, or also simple vector 
quantizers 0 examples thereof. 

We will study this type of models in the present paper. As many problems of 
unsupervised learning can be formalized in a quantization functional setting, 
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this will allow to use techniques from regularization theory. In particular this 
leads to a natural generalization (to higher dimensionality and different criteria 
of regularity) of the principal curves algorithm with a length constraint 0 • See 
also 0 for an overview and background on principal curves. Experimental results 
demonstrate the feasibility of this approach. 

In the second part we use the quantization functional approach to give uniform 
convergence bounds. In particular we derive a bound on the covering/entropy 
number, using functional analytic tools with respect to the L 00 (^ 2 ) metric. This 
allows us to give a bound on the rate of convergence by 0(m“ i+“) for arbitrary 
positive a where m is the number of examples seen. For specific kernels this 
improves on the rate in [Z1 which is 0{m~'s). Curiously, using our approach 
and a regularization operator equivalent to that implicitly used in [Z1 results 
in a weaker bound of 0{m~i). We suggest a possible reason for this in the 
penultimate section of the paper. 

2 The Quantization Error Functional 

Denote by df a vector space and X := {x\, . . . ,Xm} C df a dataset drawn iid 
from an underlying probability distribution ^(x). Moreover consider index sets 
Z, maps f : Z ^ X, and classes T of such maps (with f G !F). 

Here the map / is supposed to describe some basic properties of n{x). In par- 
ticular one seeks such / that the so-called quantization error 

R[f]:= f imn\\x - f{z)fd^J,{x) (1) 

Jx 

is minimized. Unfortunately, this is unsolvable, as /r is generally unknown. Hence 
one replaces n by the empirical density Hm{x) := ^ ~ instead 

of analyzes the empirical quantization error 

-i m 

Remplf] ■■= —'^m^n\\xi- f{z)\\^. (2) 

i—1 

Many problems of unsupervised learning can be cast in the form of finding a 
minimizer of CD or (0. Let us consider some practical examples. 

Example 1 (Sample Mean). Define Z := {!}, / : 1 ^ /i with fi G X, and E to 
be the set of all such functions. Then the minimum of 

R[f]-.= [ \\x-hrd^^{x) (3) 

Jx 

denotes the variance of the data and the minimizers of the quantization func- 
tionals can be determined analytically by 

1 m 

argmin R[f] = / xdp,{x) and argmin i?emp[/] = — Xi. (4) 

Jx /G.F rn 
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This is the (empirical) sample mean. Via the law of large numbers i?emp[/] and 
its minimize!' converge to R[f] and the corresponding minimizer (which is already 
a uniform convergence statement). 



Example 2 (k-Vectors Quantization). Define Z := fc}, f '■ i ^ fi with 

fi G V, and T to be the set of all such functions. Then 

i?[/] := / min ||a;-/;,fdAi(a;) (5) 

denotes the canonical distortion error of a vector quantizer. In practice one uses 
the /c-means algorithm to find a set of vectors {/i, . . • , /fc} minimizing Rempif]- 
Also here one can prove convergence properties of (the minimizer) of i?emp[/] 
to (the one of) R[f]. 



Instead of discrete quantization one can also consider a mapping onto a manifold 
of lower dimensionality than the input space. PC A can also be viewed in this 
way 0: 

Example 3 (Principal Components) . Define Z := M., f : z ^ fo + z ■ fi with 
/o)/i G A”, 1 1 /ill = 1, and E to be the set of all such line segments. Then the 
minimizer of 

R[f]:= [ m^in \\x - fo - z ■ fi\\^dfi{x) (6) 

Jx ^6 [ 0 . 1 ] 

yields a line parallel to the direction of largest variance in fj.{x) |^. 



Based on the properties of the current example, Hastie & Stuetzle jO] carried 
this idea further by also allowing other than linear functions f{z). 



Example 4 (Principal Curves and Surfaces). Denote Z := [0,1]^ (with D > 1 
for principal surfaces), f : z ^ f{z) with f G E he a, class of continuous IRA 
valued continuous functions (possibly with further restrictions). The minimizer 
of 

( 7 ) 



R[f] := [ min ||a; - /(z)f d^(a;) 



is not well defined, unless IF is a compact set. Moreover, even the minimizer of 
Rempif] i® not defined either, in general. In fact, it is an ill posed problem 
in the sense of Arsenin and Tikhonow m- Until recently P|; no convergence 
properties of i?emp[/] to R[f] could be stated. 



Kegl et al. [Zj modified the original “principal-curves” algorithm, in order to 
prove bounds on R[f] in terms of i?emp[/] and to show that the resulting estimate 
is well defined. The changes imply a restriction of E to polygonal lines with a 
fixed number of knots and, most importantly, fixed length L0 Instead of the 

^ In practice Kegl et al. use a constraint on the angles of a polygonal curve rather 
than the actual length constraint to achieve sample complexity rates on the training 
time. For the uniform convergence part, however, the length constraint is used. 
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latter we now consider smoothness constraints on the estimated curve f{x). This 
is done via a regularization operator. As well as allowing greater freedom in the 
choice of regularize!' (which, as we show, can lead to faster convergence), the 
regularization operator framework can be extended to situations where D > 1. 
Thus we can provide theoretical insight into principal manifold algorithms. 

3 Invariant Regularizers 

As a first step we will show that the class of admissible operators can be re- 
stricted to scalar ones, provided some basic assumption about scaling behavior 
and permutation symmetry are imposed. 

Proposition 1 (Homogeneous Invariant Regularization). Any regularizer 
Q[f] that is both homogeneous quadratic and invariant under an irreducible or- 
thogonal representation p of the group Q on X, i.e. satisfies 



is of the form Q[f] = {Pf, Pf) where P is a “scalar” operator. 

Proof. It follows directly from (EJ and Euler’s “homogeneity property”, that 
Q[f] is a quadratic form, thus Q[f] = {f,Mf) for some operator M. Moreover 
M can be written as P*P as it has to be positive (cf. (jS|)). 

Finally from {Pf,Pf) = {P p{g) f , P p{g) f) and the polarization equation it fol- 
lows that P*Pp{g) = p{g)P*P has to hold for any p{g) G Q. Thus, by virtue of 
Schur’s lemma (cf. e.g. El) P*P only may be a scalar operator. Without loss of 
generality, also P may be assumed to be scalar. 

A consequence is that there exists no “vector valued” regularization operator 
satisfying the invariance conditions. Hence it is useless to look for other operators 
P in the presence of a sufficiently strong invariance. 

Under the assumptions of proposition E both the canonical representation of 
the permutation group in a finite dimensional vector space X and the group 
of orthogonal transformations on X enforce scalar operators P. This follows 
immediately from the fact that these groups are unitary and irreducible on X 
by construction. Thus in the following we will only consider scalar operators P. 

4 A Regularized Quantization Functional 

We now propose a variant to minimizing the empirical quantization functional 
which leads to an algorithm that is more amenable to implementation. Moreover, 



Q[f] > 0 for all f ^ P 
Q[af] = a^Qlf] for all scalars a 



( 8 ) 

(9) 

( 10 ) 



Q[p{g)f] = Qif] for oil p{g) G g 
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uniform convergence bounds can be obtained for the classes of smooth curves 
induced by this approach. For this purpose, a regularized version of the empirical 
quantization functional is needed. 

A A 

i?reg[/] := i?emp[/] + ^\\Pff = ^ min.e^lla;, - f{z)f + -\\Pff. (11) 

i=l 

Here P is a scalar regularization operator in the sense of Arsenin and Tikho- 
nov, penalizing unsmooth functions / (see 0 for details). (See some examples in 
section ^21) For the sake of finding principal manifolds, utilizing a scalar regu- 
larization operator simply means all curves or surfaces which can be transformed 
into each other by rotations should be penalized equally. 

Using the results of ^ regarding the connection between regularization operators 
and kernels it appears suitable to choose a kernel expansion of / matching the 
regularization operator P, i.e. {Pk{xi, ■), Pk{xj,-)) = k{xi,Xj). Finally assume 
P*Pfo = Oj be. constant functions are not regularized. For an expansion like 

M 

f{z) = fo + J2 aik{zi,z) with Zi G Z, cXi G A”, and fc : — > R (12) 

i=l 

with some previously chosen nodes zi , . . . , zm (of which one takes as many as 
one may afford in terms of computational cost) the regularization term can be 
written as 

M 

ll^/f = H oij)k{zi, Zj). (13) 

* j'=i 

What remains is to find an algorithm that minimizes Preg- This is achieved by 
coordinate descent. In the following we will assume the data to be centered and 
therefore drop the term Jq. This greatly simplifies the notation. 



4.1 An Algorithm for minimizing i?reg[/] 



Minimizing the regularized quantization functional for a given kernel expansion 
is equivalent to solving 



min 

{Cl Cm}CZ 



m 


M 


1 


i=i 



A . . 

- {a„aj)k{z-,,Zj) 



(14) 



This is achieved in an iterative fashion analogously to how the EM algorithm 
operates. One iterates over minimizing (1 1 411 with respect to {Ci, . . . , Cm}, equiv- 
alent to the projection step, and {oi, . . . , om}, which corresponds to the expec- 
tation step. This is repeated until convergence, in practice, until the regularized 
quantization functional does not decrease significantly any further. One obtains: 
Projection: For each i G {l,...,mj choose Q := argmin^.^^ 11/(0 ~ 
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Clearly, for fixed ai, the so chosen Q minimizes the term in m, which in turn is 
equal to i?reg[/] for given ai and X. Adaptation: Now the parameters Q are fixed 
and ai is adapted such that i?reg[/] decreases further. For fixed Q differentiation 
of (ITHl with respect to ai yields 

QiF, + Kj a = KjX (15) 



where {Kz)ij '■= k{zi,Zj) is an M x M matrix and {K(^)ij := k{Q,Zj) is m x M. 
Moreover, with slight abuse of notation, a, and X denote the matrix of all 
parameters, and samples, respectively. The term in (0 keeps on decreasing 
until the algorithm converges to a (local) minimum. What remains is to find 
good starting values. Initialization If not dealing, as assumed, with centered 
data, set /o to the sample mean, i.e. /o = Moreover, choose the 

coefficients ai such that / approximately points into the directions of the first 
D principal components given by the matrix V := (ui, . . . , vd)- This is done as 
follows, analogously to the initialization in the generative topographic map j21 
eq. (2.20)]. 



mm 



M 

E 

2=1 



M 



V{zi-zo)-y^^ajk{z^,Zj) 



M 



+ 9 E {o‘i,oij)k{zi,Zj) 



* j=i 



(16) 

Thus a is given by the solution of (|-1 + K^) a = V{Z — Zg) where Z denoted 
the matrix of z,, zg the mean of Zi, and Zg the matrix of zg correspondingly. 



The derivation of this algorithm was quite ad hoc, however, there are simi- 
lar precursors in the literature. An example are principal curves with a length 
constraint. We will show below that for a particular choice of a regularizer, 
minimizing (El) is equivalent to the latter. 



4.2 Examples of Regularizers 

By choosing P := dz, i.e. the differentiation operator, ||P/|P becomes an integral 
over the squared “speed” of the curve. Reparameterizing / to constant speed 
leaves the empirical quantization error unchanged, whereas the regularization 
term is minimized. This can be seen as follows: by construction \\dzf{z)\\dz 
does not depend on the (re)parameterization. The variance, however, is minimal 
for a constant function, hence ||i9z/(z:)|| has to be constant over interval [0,1]. 
Thus IjT’/lP equals the squared length of the curve at the optimal solution. 

One can show that minimizing the empirical quantization error plus a regularizer 
is equivalent to minimizing the empirical quantization error for a fixed value of 
the regularization term (for A adjusted suitably) . Hence the proposed algorithm 
is equivalent to finding the optimal curve with a length constraint, i.e. it is 
equivalent to the algorithm proposed by 0 H 

^ The reasoning is slightly incorrect — / cannot be completely reparameterized to 
constant speed, as it is an expansion in terms of a finite number of nodes. However 
the basic properties still hold, provided the number of kernels is sufficiently high. 
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In the experiments we chose a Gaussian RBF kernel k{x,x') = exp(— )■ 
This corresponds to a regularizer penalizing all orders of derivatives simultane- 
ously. In particular |14j show that this kernel corresponds to the pseudodiffer- 
ential operator defined by 

f °° 

iip/f = dxj2^io-f{x)r (17) 

'' n=0 

with = Z\" and O^n-i-i _ VZ\", A being the Laplacian and V the gradient 
operator. This means that one is looking not only for smooth functions but also 
curves whose curvature and other higher-order properties change very slowly. 
For more details on regularization operators see e.g. 

4.3 The Connection to the GTM 

Just considering the basic algorithm of the GTM (without the Bayesian frame- 
work), one can observe that it minimizes a rather similar quantity to i?reg[/]- 
It differs in its choice of Z, which is chosen to be a grid, identical with the 
points Zi in our setting, and the different regularizer (called Gaussian prior in 
that case) which is of £2 type. In other words instead of using HP/lP Bishop et 
al. |2| choose l|c«i|P as a regularizer. Finally in the GTM several Q may take 
on “responsibility” for having generated a data-point Xi (this follows naturally 
from the generative model setting in the latter case). 

Note that unlike in the GTM (cf. 0 sec. 2.3]) the number of nodes (for the 
kernel expansion) is not a critical parameter. This is due to the fact that there 
is a coupling between the single centers of the basis functions k(zi,Zj) via the 
regularization operator. If needed, one could also see the proposed algorithm in 
a Gaussian Process context (see ^21) — the data X then should be interpreted 
as created by a homogeneous process mapping from Z to X . Finally the use of 
periodical kernels (cf. |0|) allows one to model circular structures in X. 

5 Experiments 

In order to show that the basic idea of the proposed algorithm is sound, we 
ran several toy experiments (cf. figure Q]). In all cases Gaussian rbf kernels, 
as discussed in section E3 were used. We generated different data sets in 2 
and 3 dimensions from 1 or 2 dimensional parameterizations. Then we applied 
our algorithm using the prior knowledge about the original parameterization 
dimension of the data set in choosing the latent variable space to have the 
appropriate size. For almost any parameter setting (A, M, and width of basis 
functions) we obtained reasonable results. 

We found that for a suitable choice of the regularization factor A a very close 
match to the original distribution can be achieved. The number and width of the 
basis functions had of course an effect on the solution, too. But their influence 
on the basic characteristics is quite small. 
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Fig. 1. Upper 4 images. We generated a dataset (small dots) by adding noise to 
a distribution indicated by the dotted line. The resulting manifold generated by 
our approach is given by the solid line (over a parameter range of 2 = [—1,1]). 
From left to right we used different values for the regularization parameter A = 
0.1, 0.5, 1,4. The width and number of basis function was constant 1, and 10 
respectively. Lower 4 images. Here we generated a dataset by sampling (with 
noise) from a distribution depicted in the left most image (small dots are the 
sampled data). The remaining three images show the manifold yielded by our 
approach over the parameter space Z = [—1, 1]^ for A = 0.001, 0.1, 1. The width 
and number of basis functions was constant (1 and 36). 



Finally, figure El shows the convergence properties of the algorithm. One can 
clearly observe that the overall regularized quantization error decreases for each 
step, while both the regularization term and the quantization error term are free 
to vary. This experimentally shows that the algorithm finds a (local) minimum 
of i?reg[/]. 

6 Uniform Convergence Bounds 

We now proceed to an analysis of the rate of convergence of the above algo- 
rithm. To avoid several technicalities (like boundedness of some moments of the 
distribution fx(x) f^) we will assume that there exists a ball of radius r such 
that Pr{||a;|| < r} = 1 for all x. Kegl et al. showed that under these assump- 
tions also the prinicipal manifold / is contained in the ball Ur of radius r, hence 
the quantization error will be no larger than (2r)^ for all x. In order to derive 
uniform convergence bounds let us introduce the Loo (^ 2 ) norm on T (assumed 
continuous) 

II/IIloo(^^) — sup ||/(z)Lg ( 18 ) 

where the jj • jj^d denotes the Euclidean norm in d dimensions. The metric is 
induced by the norm in the usual fashion. Given a metric p and a set T , the 
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Fig. 2. Left: regularization term, middle: empirical quantization error, right: 
regularized quantization error vs. number of iterations. 



e covering number of T, denoted A/"(e, T, p) (also AC when the dependency is 
obvious), is the smallest number of p-balls of radius the union of which contains 
T. 

The next two results are similar in their style to the bounds obtained in |2|, how- 
ever slightly streamlined, as they are independent of some technical conditions 
on T as needed in |2|. 

Proposition 2 (Loo (^ 2 ) bounds for Principal Manifolds). 

Denote hy T a class of continuous functions from Z into X C Ur and let p be a 
distribution over X. If m points are drawn i.i.d. from p, then for all rj > 0,e G 
(0,v/2) 



Pr|sup|i?-p[/]-i?[/]| >7?! <2Af(^,L-,Loo(f^))e--(’'-^)^/^^^^^ 

Proof. By definition of R^^plf] = Yl'iLi minz \\f{z) — XiW^ the empirical quan- 
tization functional is a sum of m iid random variables which are each bounded 
by 4r^ due to the fact that x is contained in a ball of radius r. Hence we may 
apply Hoeffding’s inequality to obtain 

Pr{|i?™p[/] -^[/]| > ^} < (^ 9 ) 

By the Lipschitz property of the £2 norm (the ‘target‘ values are bounded by 
r), a ^ cover of L" is an | cover of the loss function induced class: For every 
f G T there exists some fi G Afe/sr such that \\fi — < §• Hence also 

\RTmp[f] - RTmpiMl < I and \R[f] - R[f,] \ < |. Consequently 

Pr { |^e^p[/] - R[f] I > ^} < Pr [\RTmplh] - RIM \>V-e} (20) 

Substituting (EOll into (d2D and taking the union bound over Afg/g,r gives the 
desired result. 

This result is useful to assess the quality of an empirically found manifold. In 
order to obtain rates of convergence we also need a result connecting the expected 
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quantization error of the principal manifold f*^p minimizing and the 

manifold f* with minimal quantization error R[f*]. 

Proposition 3 (Rates of Convergence for Optimal Estimates). 

With the definitions of Proposition and the definition of /emp f* one has 

( 'I 2 

Pr sup |i?[/:^p] - R[r]| > r? U 2 (Af + 1) e-^^. 

Proof. The proof is similar to the one of proposition O, however uses NpUr and 

,/2 to bound 

^[/emp] - ^[/1 = ^[/emp] ~ i?emp [/e^dxdp] + ^emp [/e^n^p] ~ i?[/1 (21) 

< e + R[U] - R,^p[h] + l?emp[/:„,p] - R[n (22) 

where fi G Me and clearly Remp[/emp] < Remplf*]- Now apply Hoeff ding’s in- 
equality, the union bound and change r] + e into ry to prove the claim. 

After that we provided a number of uniform convergence bounds it is now nec- 
essary to bound Af in a suitable way. 



7 Covering and Entropy Numbers 

Before going into details let us briefly review what already exists in terms of 
bounds on the covering number M for Loo (^ 2 ) nietrics. Kegl et al. 0 essentially 
show that 

logAf(e,L-) = 0(i) (24) 

under the following assumptions: They consider polygonal curves /(•) of length 
L in a sphere Ur of radius r in X. The distance measure (no metric!) for M{e) 
is defined as snp^^jj^ |Z\(a;, /) — Z\(a;, /')| < e. Here A{x,f) is the minimum 
distance between a curve /(•) and x G Ur- 

By using functional analytic tools H3| one can obtain more general results, which 
then, in turn, can replace m to obtain better bounds on the expected quanti- 
zation error by using the properties of the regularization operator. 

Denote by £(L, F) the set of all bounded linear operators T between two normed 
spaces (L, || • jj^;), (F, || • \\p). The nth entropy number of a set M C E relative 
to a metric p, for n G N, is 

e„(M) := infje: M{e, M, p) < n} 

The entropy numbers of an operator T G £(F, F) are defined as 



en{T) := en{T{UE)). 



(25) 
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Note that e\{T) = ||T||, and that €n{T) certainly is well defined for all n S N if 
T is a compact operator, i.e. if T{Ue) is compact. 

What will be done in the following is to bound the entropy number of parametrized 
curves in ^00(^2) satisfying the constraint ||^’/(-)lP < ^ by viewing 

Ta '■= {/: Z ^ z ^ f{z) G / is continuous, ||P/|| < A} 

as the image of the unit ball under an operator T. A key tool in bounding the 
relevant entropy number is the following factorization result. 

Lemma 1 (Carl and Stephani Pl p. 11]). Let E,F,G be Banach spaces, 
R G Z{F, G), and S G 2{F, F). Then, for n, t G N, 

ent(RS) < en{R)et{S), en{RS) < en(i?)||5||, < e„(5)||i?||. (26) 

As one is dealing with vector valued functions Fa, it handy to view /(•) as 
generated by a linear d = dim X dimensional operator in feature space, i.e. 
f{z) = W<P{z) = {{wi,<P{z)),...,{wd,<P{z))) with ||W|p IkiP- Here 

the inner product (•, •) is given by the regularization operator P as 

if, g) ■■= {Pf, Pg)L. = I {Pf){x)dx (27) 

where the latter was described in section El In practice w is expanded in terms 
of kernel functions k{xi, •). The latter can be shown to represent the map from 
Z into the associated Reproducing Kernel Hilbert Space (RKHS) [0| (sometimes 
called feature space). Hence d>{x) = k{xi,-), where the dot product is given by 
(EZ». These techniques may be used to give uniform convergence bounds, which 
are stated in terms of the eigenvalues Xi of the RKHS. 

Proposition 4 (Williamson, Smola, and Scholkopf [13] 1. Let T>{-) he the 
map onto the eigensystem introduced by a Mercer kernel k with eigenvalues Xi, 
Gk a constant of the kernel, and A be the diagonal map 

A : R ^ R , A : {xj)j 1 -^ A{xj)j = (ajXj)j. (28) 

Then A~^ maps d>{X) into a ball of finite radius Ra = Ck\\{\/^o,j)j\\( 2 > centered 
at the origin if and only if G £ 2 - 

The evaluation operator S plays a crucial role to deal with entire classes of 
functions (instead of just a single /(•))• R is defined as 

S^iz) : (^2)^ ^ Loo(^ 2) and : W ^ {{wi,^{Z)) , . . . , {wd,<P{Z))) . (29) 

By a technical argument one can see that it is possible to replace {£ 2 )^^ by £2 
without further worry — simply reindex the coefficients by 

Id : {£2^ ^ £2 

^ ^ ((Wll, W 12 , ...), (W21,W22, ■■■),■•■, (Wdl,Wd2, •■ •)) 

■ (Wll,?«21, ■ • ■ ,Wdl,Wi2,W22, ■ ■ -,Wd2,Wi3, . . .) 



( 30 ) 
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By construction and vice versa, thus ||/d|| = = 1- Before 

proceeding to the actual theorem one has to define a scaling operator Ad for the 
multi output case as the d times tensor product of A, i.e. 

Ad ■■ {^ 2 Y {^2^ and Ad := A x A x . . . x A (31) 

d-times 



Theorem 1 (Bounds for Principal Curves Classes). Let k be a Mereer 
kernel, be <P the eorresponding map into feature spaee, and let T := S,^(^z)A 
where S 4 ,(^z) given by hUfA) and A G M"*". Let A be defined by and Ad by 
(EH). Then the entropy numbers of T satisfy the following inequality: 

en{T) < Aen{Ad) (32) 

Proof. As pointed out before one has to use a factorization argument. In partic- 
ular one uses the following property. 




In other words one exploits 

£n (>5'^(2) (^^(^ 2 )“*)) “ {S{A~^'P(Z))^dAIj^ ) (34) 

— ||'^(A-i^(z)) II £n(^d)A ||/^ ^ II < Ae„(A(i). (35) 

Here we have relied on Proposition El which says A~^<L{Z) C Lf and thus by 
Cauchy-Schwarz, ||5'(^-i,|>(2))|| < 1. 



The price for dealing with vector valued functions is a degeneracy in the eigen- 
values of Ad — scaling factors appear d times, instead of only once in the single 
output situation. From a theorem for degenerate eigenvalues of scaling operators 
m one immediately obtains the following corollary. 



Corollary 1 (Entropy numbers for the vector valued case). Let k be a 

Mercer kernel, let A be defined by (E3) and Ad by /T?7I) . Then 



en{Ad: £2 ^ h) < inf sup6CfcVd 




S 



n rd (ai 02 • • • afi)i . 
t2 



Note that the dimensionality of Z does not affect these considerations directly, 
however it has to be taken into account implicitly by the decay of the eigenvalues 
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Ol of the integral operator induced by k. d appears twice — once due to the 
increased operator norm (the y/d term) for the scaling operator Ad, and secondly 
due to the slower decay properties (each scaling factor ai appears d times). 

The same techniques that led to the bounds on entropy numbers in m can also 
be applied here. As this is rather technical, we only sketch a similar result for 
the case of principal manifolds, for dim 2^ = 1 and exponential polynomial decay 
of the eigenvalues Ai of the kernel k. 



Proposition 5 (Exponential Polynomial decay). Suppose k is a Mercer 
kernel with Xj = for some a,fS,p > 0. Then 

In (Ad', h ^ h) = 0(ln^ n) (36) 



Proof. We use a series {aj)j = e ^ Then we bound 




+ Ra-rp/r 



1 — S -5^ : -P 

and (flia 2 . . -aj)^ = e s=i < for some positive number (j). For the 

purpose of finding an upper bound, sup^gj^ can be replaced by sup^gj]^ One 

computes sup^-gq n * e which is obtained for some j = ^ Inr+i n and 
some (/)' > 0. Resubstitution yields the claimed rate of convergence for any 
T S (0, a) which proves the theoremU 



Possible kernels for which proposition 0 applies are Gaussian rbf, i.e. k{x,x') = 
exp(— ||x — cc'lp) {p = 2) and the “Damped Harmonic Oscillator”, i.e. k{x,x') = 
with p =1. For more details on this issue see H3|. Finally one has to 
invert 116611 to obtain a bound on N{€,Ta). We have: 

\nM(^j,TA,L^{ii))=0{-\n^ e) (37) 

A similar result may be obtained for the case of polynomial decay in the eigen- 
values of the Mercer kernel. Following H3| one gets: 



Proposition 6 (Polynomial decay). Suppose k is a Mercer kernel with Xj = 
for some a,/3 > 0. Then ef^{Ad : £2 £ 2 ) = 0(ln^ n). 



8 Rates of Convergence 

It is of theoretical interest how well Principal Manifolds can be learned. Kegl et 
al. |Zj have show a result for principal curves {D — 1) with length 



® See |1 how exact constants can be obtained instead of solely asymptotical rates. 
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contraint regularize!'. We show that if one utilizes a more powerful regularizer (as 
one can do using our algorithm) one can obtain a bound of the form 0{m~ 2 (°<+i) ) 
for polynomial rates of decay of the eigenvalues of fc (a+1 is the rate of decay); or 
for exponential rates of decay {a is an arbitrary positive constant). 
The latter is nearly optimal, as supervised learning rates are no better than 
0(m“^/^). 

Proposition 7 (Learning Rates for Principal Manifolds). 

For any fixed Ta the learning rate of principal manifolds can be lower bounded 
by 0(771“^/^“*"“) where a is an arbitrary positive constant, i.e. 

RifLp] - Rin < 0 (m-i/ 2 +«) for r G Ra (38) 

if the eigenvalues of k decay exponentially. Moreover the learning rate can be 
bounded by 0(m~ ^{o+i) ) in the case of polynomially decaying eigenvalues with 
rate a + 1. We obtain 

RifLp] - Rin <0{m-^) for f:^^, r & Ta (39) 



Proof. We use a clever trick from |E], however without the difficulty of also 
having to bound the approximation error. Proposition 0 will be useful. 



i?[/eW - Rin = r Pr {^[/emp] ~ RlH > v} dg 

m(7j — e)^ 

, e 87^ drj 



<u + e + 2(Af(e/4r) + 1) 

<u + (^+ 7^(-^(e/4r) + l)e” 

< 



8r^ ln(A/”(e/4r) + l) 



8r^ 

m ln(AA(€/4r) + l) 



(40) 



Here we used exp(— t^/2)dt < exp(— x^/2)/a; in the second step. The third 
inequality was derived by substituting ^ log(A/’(e/4r) + 1). 

Setting e = ^/TJm and exploiting (E3 yields 



R[fU\ - R[n = o 



In p m/m 



+ 0(m 2 ). 



(41) 



p + 1 

As In p m can be bounded by any Cam'^ for suitably large Ca and a > 0 one ob- 
tains the desired result. For polynomially decaying eigenvalues one obtains from 
proposition El that for a sufficiently large constant c In Af (e/4r, Loo(^ 2 )) — 

ce~i . Substituting this into (BTijl yields 



^[/emp] - Rin < 




+ 2e + 0{m 2 ). 



( 42 ) 



The minimum is obtained for e = dm for some d > Q. Hence m 

is of order 0{m~ 2 ( 0 + 1 ) )^ which proves the theorem. 
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Interestingly the above result is slightly weaker than the result in jj] for the case 
of length constraints, as the latter corresponds to the differentiation operator, 
thus polynomial eigenvalue decay of order 2, i.e. a = 1 and therefore to a rate 
2 (^a+i) ~ I (Kegl et al. 0 obtain i). It is unclear, whether this is due to a 
(possibly) not optimal bound on the entropy numbers induced by /c, or the fact 
that our results were stated in terms of the (stronger) Loo (^ 2 ) metric. This yet 
to be fully understood weakness should not detract from the fact that we can get 
better rates by using stronger regularizers, and our algorithm can utilize such 
regularizers. 



9 Summing Up 

We proposed a framework for unsupervised learning that can draw on the tech- 
niques available in minimization of risk functionals in supervised learning. This 
yielded an algorithm suitable to deal with principal manifolds. The expansion in 
terms of kernel functions and the treatment by regularization operators made it 
easier to decouple the algorithmic part (of finding a suitable manifold) from the 
part of specifying a class of manifolds with desirable properties. In particular, 
our algorithm does not crucially depend on the number of nodes used. 

Sample size dependent bounds for principal manifolds were given which depend 
on the underlying distribution /i in a very mild way. These may be used to 
perform capacity control more effectively. Moreover our calculations have shown 
that regularized principal manifolds are a feasible way to perform unsupervised 
learning. The proofs largely rest on a connection between functional analysis 
and entropy numbers This fact also allowed us to give good bounds on the 
learning rate. 
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Abstract. Vapnik-Chervonenkis (VC) bounds play an important role 
in statistical learning theory as they are the fundamental result which 
explains the generalization ability of learning machines. There have been 
consequent mathematical works on the improvement of VC rates of con- 
vergence of empirical means to their expectations over the years. The 
result obtained by Talagrand in 1994 seems to provide more or less the 
final word to this issue as far as universal bounds are concerned. Though 
for fixed distributions, this bound can be practically outperformed. We 
show indeed that it is possible to replace the 2e^ under the exponential 
of the deviation term by the corresponding Cramer transform as shown 
by large deviations theorems. Then, we formulate rigorous distribution- 
sensitive VC bounds and we also explain why these theoretical results on 
such bounds can lead to practical estimates of the effective VC dimension 
of learning structures. 



1 Introduction and motivations 

One of the main parts of statistical learning theory in the framework developed 
by V.N. Vapnik m.m is concerned with non-asymptotic rates of convergence 
of empirical means to their expectations. 

The historical result obtained originally by Vapnik and Chervonenkis (VC) (see 
^iJJ) has provided the qualitative form of these rates of convergences and 
it is a remarkable fact that this result holds with no assumption on the prob- 
ability distribution underlying the data. Consequently, VC-theory of bounds is 
considered as a Worst-Case theory. 

This observation is the source of most of the criticisms addressed to VC-theory. 
It has been argued (see e.g. B), 0, 0, JEZ]) that VC bounds are loose in general. 
Indeed, there is an infinite number of situations in which the observed learning 
curves representing the generalization error of some learning structure are not 
well described by theoretical VC bounds. 



P. Fischer and H.U. Simon (Eds.): EuroCOLT’99, LNAI 1572, pp. 230l^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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In |1 7j. D. Schuurmans criticizes the worst-case-argument by pointing out that 
there is no practical evidence that pathological probability measures must be 
taken into account. This is the open problem we want to tackle : the distribu- 
tion-sensitivity of VC bounds. 

Another question which motivates our work (Vapnik et al. j‘24j l is the measure 
of effective VC dimension. The idea to use a VC bound as an estimate of the 
error probability tail, and to simulate this probability to identify the constants 
and to estimate the VC dimension “experimentally” . 

We will show how to improve these results by computing new accurate VC 
bounds for fixed families of distributions. 

It is thus possible to provide a deeper understanding for VC theory and its main 
concepts. We also want to elaborate a practical method for measuring empirically 
the VC dimension of a learning problem. This part is still work in progress (see 
forthcoming |2E| for examples and effective simulations). 

2 Classical VC bounds 



We first present universal VC bounds. For simplicity, we consider the particu- 
lar case of deterministic pattern recognition with noiseless data. The set-up is 
standard : 

Consider a device T which transforms any input X € in some binary output 

Y G {0,1}. Let us denote P the distribution of the random variable (A, V), p, 
the distribution of X and R the Borel set in IR'^ of all A’s associated to the label 

V = 1. 

The goal of learning is to select an appropriate model of the device T among 
a fixed set P of models C on the basis of a sample of empirical data (Ai,Vi), 
..., (Xn,Y„). Here, T is a familjfl of Borel sets of IR"^ with finite VC dimension 
V. The VC dimension is a complexity index which characterizes the capacity of 
any given family of sets to shatter a set of points. 

The error probability associated to the selection of C in T is : 

L{C) = gL{CAR) {true error) 

1 " 

Ln{C) = - lczifl(Vfc) = ^„(CAi?) {empirical error) 
n 

k=\ 

where is the empirical measure Hn = ^ Sfc=i ■ 

The problem of model selection consists in minimizing the (unknown) risk func- 
tional L{C) = fi{CAR), problem usually replaced by a tractable one which is the 
minimization of the empirical risk Ln{C) = p,n{CAR) (this principle is known as 
ERM for Empirical Risk Minimization) . But then, one has to guarantee that the 

^ r satisfies some technical, but unimportant for our purpose, measurability condition. 
In order to avoid such technicalities, we will assume that P is countable. 
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minimum of the empirical risk is “close” to the theoretical minimum. This is pre- 
cisely the point where Vapnik-Chervonenkis bound drops in. Their fundamental 
contribution is the upper bound of the quantity 

Q{n,e,r,fi) = Pr I sup |m„(C) - n{C)\ > e 
iCer 




Remark 1. Note that 

Pr I sup |L„(C') - L{C)\ > el = Pr I sup \nn{CAR) - ^J^{CAR)\ > e 
leer J Leer 

and by a slight notational abuse without any consequence on the final resulil, 
we take C := CAR and F := FAR — {CAR : C S T}. 

We recall here this result : 

Theorem 1 (Vapnik-Chervonenkis Let F be a class of Borel sets of 

with finite VC dimension V. Then, for ne^ > 2, 

sup Pr I sup \^ln{C) - ti{C)\ > el < 4 
leer J 

Remark 2. For a very readable proof, see 0 . 

This bound actually provides an estimate of the worst rate of convergence of the 
empirical estimator to the true probability. 

To comment on the form of the previous upper bound, we notice that the expo- 
nential term quantifies the worst deviation for a single set C and the polynomial 
term characterizes the richness of the family F. 

There have been several improvements for this type of bound since the pioneering 
work of Vapnik and Chervonenkis PJ(see Vapnik [23|, Devroye|0|, Folla.rdjTnj , 
AlexanderfP, Parrondo-Van den Broek PH]) Ta,la,gra,nd[l Dj. Lugosi USD- 
Many of these improvements resulted from theory and techniques in empirical 
processes (see Pollard Jin), Alexander P), TalagrandJEJ), and these works indi- 
cated that the proper variable is C\/n (or ne^). Keeping this in mind, we can 
summarize the qualitative behavior of VC-bounds by the following expression : 

K{e, V) ■ (ne2)”(^) • e~^ for ne^ > M , 

capacity deviation 

with M constant, r an affine function of V, 7 G [0,2], and K{e,V) constant 
independent of n, possibly depending on e and V (ideally K{e, V) < K{V)). 

Once we have stated this general form for VC-bounds, we can address the fol- 
lowing issues (both theoretically and practically) : 

Indeed, for a fixed set R, we have VCdim{F) = VCdim{F AR). For a proof, see e.g. 
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(a) What is the best exponent 7 in the deviation term ? 

(b) What is the correct power t{V) of n in the capacity term ? 

(c) What is the order of the constant term K{V) for the bound to be sharp ? 

In Table n we provide the theoretical answers brought by previous studies, in a 
distribution-free framework. 



Table 1. Universal bounds 





M 


K{t,V) 


r{V) 


7 


Pollard (1984) 


2 


00 


V 


1/32 


Vapnik-Chervonenkis (1971) 


2 




V 


1/8 


Vapnik (1982) 


2 




V 


1/4 


Parrondo-Van den Broeck (1993) 


2 




V 


1 


Devroye (1982) 


1 




2V 


2 


Lugosi (1995) 


V 

2 


4e{V + l)(^^y 


2V 


2 


Alexander (1984) 


64 


16 


2048U 


2 


Talagrand (1994) 


0 


K{V) 


V- i 
^ 2 


2 



to conclude this brief review, we point out that in the above distribution-free 
results, the optimal value for the exponent 7 is 2 (which actually is the value 
in Hoeffding’s inequality), and the best power achieved for the capacity term is 
the one obtained by Talagrand V — \ (see also the discussion about this point 
in PO])- In most of the results, the function K{e,V) is not bounded as e goes 
to zero, and only Alexander’s and Talagrand’s bounds satisfy the requirement 
AT(e,U) < K{V). 

Our point in the remainder of this paper is that the 2e^ term under the expo- 
nential can be larger in particular situations. 



3 Rigorous distribution-dependent results 

In the continuity of the results evoked in the previous section, one issue of 
interest is the construction of bounds taking into account the characteristics 
of the underlying probability measure /r. 

There are some works tackling this problem but with very different perspectives 
(see Vapnik Bartlett-Lugosi |3j, in a learning theory framework; Schuurmans 
H3 , in a PAC-learning framework; Pollard [HI , Alexander PJ , Massart P!> who 
provide the most significant results in empirical processes). 
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We note that : 

— in learning theory, the idea of distribution-dependent VC-bounds led to oth- 
er expressions for the capacity term, involving different concepts of entropy 
as VC-entropy, annealed entropy or metric entropy, depending on the prob- 
ability measure. 

— while in the theory of empirical processes, a special attention was given to 
refined exponential rates for restricted families of probability distributions 
(see P, HP). 

Our purpose is to formulate a distribution-dependent result preserving the struc- 
ture of universal VC bounds with an optimal exponential rate and with some 
nearly optimal power t(V), though we will keep the concept of VC dimension 
unchangecQ. 

Indeed, we would like to point out that if we consider a particular case where 
the probability measure /r underlying the data belongs to a restricted set V C 
AIi(IR^), then the deviation term can be fairly improved. Our argument is bor- 
rowed from large deviations results which provide asymptotically exact estimates 
of probability tails on a logarithmic scale. A close look at the proof of the main 
theorem in the case of real random variables (Cramer’s theorem, for a review, 
see 0 or m) will reveal that the result holds as a non-asymptotical upper 
bound. Thanks to this result, we obtain the exact term under the exponential 
quantifying the worst deviation. 

In order to formulate our result, we need to introduce the Cramer transform 
(see the appendix) of a Bernoulli law with parameter p given by : Ap(x) = 

xlog(l) +(l-x)log(^),forxin [0,1]. 

Then, the uniform deviation of the empirical error from its expectation, for 
a fixed family of probability distributions, can be estimated according to the 
following theorem (a sketch of its proof is given in Sect. 0) : 

Theorem 2. Let F be a family of measurable sets C of IR'^ with finite VC 
dimension V , and V C ^41(11^^) a fixed family of probability distributions p. 

Let Ap be the Cramer transform of a Bernoulli law with parameter p, let J = 
{q : q — p{C), (p,C) G V x F} and set p = argmin^gj | g — ^ |. For every 
P > 0 , there exists M{P,p,V) and eo(/3,p, V) > 0 such that if e < eo{P,p,V) 
and ne^ > M(/3,p, V), we have : 

sup Pr I sup \pn{C) - p{C)\ > el < K{V){ne^f /3yA^(e+p) 

u&v Leer J 

Remark 3. The corrective term /3 can be chosen to be as small as possible at the 
cost of increasing M{P,p,V). 



® However, we could use alternatively effective VC dimension which is a distribution- 
dependent index (see m for details). 
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Remark 4- Here we achieved t{V) = V instead of the optimal V — ^ found by 
Talagrand in nni. However, refining the proof by using a smart partitioning of 
the family F should lead to this value. 



Remark 5. Note that the result above can be extended to the other fundamental 
problems of statistics as regression or density estimation. 



4 Comparison with Universal VC Bounds 

To appreciate the gain in considering distribution-dependent rates of convergence 
instead of universal rates, we provide a brief discussion in which we compare the 
Ap(e + p) in our result with the universal 

First, we point out that even in the worst-case situation (take V = Ati(lR‘^)) 
where p = ^, we have a better result since A = A^{e + \) > 2e^ (see Fig.P). 




E 

Fig. 1. Comparison between A = A^{e + ^) and 2e^. 



In the general case when p yf we claim that the distribution-dependent VC 
bound obtained in Theorem El is of the same type of universal bounds listed 
in Sect. m In order to make the comparison, we recall a result proved by W. 
Hoeffding : 



Proposition 1 (Hoeffding For any p G [0, 1], the following inequality 

holds : 
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where the function g is defined by : 



g{p) = 



1 

l-2p 



log 



1 -P 
P 



1 

2p{l-p) 



,ifP<\ 
,ifp>\ ■ 



With the help of Fig. Q, the comparison between g{j)) and the values of 7 becomes 
quite explicit. Indeed, it is clear that, as soon as p ^ 1/2, we have a better bound 
than in the universal case. 




p 

Fig. 2. Comparison between distribution- dependent g{p) and universal 7 ’s. 



5 PAC-Learning Application of the Result 

A PAC-learning formulation of distribution-dependent VC bounds in terms of 
sample complexity can easily be deduced from the main result : 

Corollary 1. Under the same assumptions as in Theorem\^ The sample com- 
plexity fV(e, i5), that guarantees : 

Pr I sup \pn{C) - /i(C)| > el < (5 
leer J 

for n > fV(e, <5), is bounded by : 

"('■ s >“S (^) > f >«S (^) ) 

where A= {1 — (3) ■ Ap{e-\-p). 
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Remark 6. In order to appreciate this result, one should consider that Ap(e+p) ~ 

gip^- 



Proof. Consider n such that : (ne^)^ < Then , taking the log and mul- 

tiplying by e^, we obtain : ne^ > ^^-log(ne^). Thus, taking the log again, 
we have log(ne^) > log(^^-) which we inject in the last inequality. We get : 
n> ^ log ■ If satisfies the previous condition, we have : < 

g-nA/ 2 ^ and we want to be smaller than 5. Hence, n should also 

satisfy : n > | log • 

As a matter of fact. Theorem |2| provides an appropriate theoretical foundation 
for computer simulations. Indeed, in practical situations, a priori informations 
about the underlying distribution and about realistic elements C of the family 
r turn distribution-dependent VC bounds in an operational tool for obtaining 
estimates of the effective VC dimension V and of the constant K{V) as well (see 
m for examples). 



6 Elements of proof for Theorem 2 

In this section, we provide a sketch of the proof of Theorem|2l(for a complete and 
general proof, see m)- It relies on some results from empirical processes theory. 
The line of proof is inspired from the direct approximation method exposed by 
D. Pollard while most of the techniques and intermediate results used in 
this proof are due to M. Talagrand and come from m, m- 

First, note that if the family T is finite, the proof is a straightforward conse- 
quence of Chernoff’s bound (see the appendix) together with the union-of-events 
bound. In the case of a countable family, we introduce a finite approximation 
Px which is a A-ne10 for the symmetric difference associated to the measure p, 
with cardinality N{P, p, A) = N{\). We shall take A = 

The first step of the proof is to turn the global supremum of the empirical 
process G„(C) = Pn{C) — p{C) into a more tractable expression like the sum of 
a maximum over a finite set and some local supremum. Then, the tail Q(n, e, T, p) 
is bounded hy A + B, where A is the tail of the maximum of a set of random 
variables which can be bounded by : 



A<V(A) max Pr||G„(G*)|>(l-^)e| , (1) 

^ If T is totally bounded, by definition, it is possible, for every A > 0, to cover P by 
a finite number of balls of radius A centered in P. Consider a minimal cover of P, 
then a A-net will be the set of all the centers of the balls composing this cover. 
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and B is the tail of the local supremum of a family of random variables bounded 
as follows : 



S < iV(A) max Pr J sup |G„(C) - G„(C*)| > ^ I , (2) 

c*Gr^ [cgB(c*,a) ^ J 

where B{C*,X) = {C G B : ^i{CAC*) < A}. 

The probability tail in dU can be bounded by large deviations estimates accord- 
ing to Chernoff’s bound : 

Pr ||G„(G*)| > (1 - ^)e| < , 



where p = argmin^ ^ q=^l{C), (/r,c)GPxr k - 5 I- 

The estimation of 0 requires the use of technical results on empirical process- 
es mainly from m and m : symmetrization of the empirical processes with 
Rademacher random variables, decomposition of the conditional probability tail 
using the median, application of the chaining technique. In the end, we introduce 
the parameter u to obtain the bound : 



B < 4N{X) ( 2e -|_ 

= m{X){D + F + G) 






( 3 ) 



where toi = ki{V) ■ (1/ne^) • log(fc 2 ne^)), and m 2 = k^{V) ■ (1/ne) • log(fc 2 ne^)). 
The meaning of each of the terms in @ is the following : D measures the 
deviation of the symmetric process from the median, F controls its variance and 
G bounds the tail of the median which can be controlled thanks to the chaining 
technique. 

To get the proper bound from Q, one has to consider the constraint on u : 



kb{P,p, V) 

n log(ne^) 



k4{/3,p) 



which leads to the condition : ne^ > M{P,p,V). 

To get the desired form of the bound, we eventually apply a result due to D. 
Haussler 0: 

/2eA ^ 

N{X)<e{V + l)^-j , 

and set A = u G I, which ends the proof. 
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7 Appendix - Chernoff ’s bound on large deviations 

We remind the setting for Chernoff’s bound (see P| for further results). 

Consider v a probability measure over IR. : IR — >]0,-|-oo] is the Laplace 
transform of v, defined by D{t) = ly(dx). 

The Cramer transform A : IR — > [0, -l-oo] of the measure ly is defined, for a; S IR 
, by 

A{x) = sup {tx — log . 
tem. 

If we go through the optimization of the function of t inside the sup (it is a simple 
fact that this function is infinitely differentiable, cf. e.g. CHI), we can compute 
exactly the optimal value of t. Let t{x) be that value. Then, we write 

A{x) = t{x)x — logi){t{x)) . 



Proposition 2 (Chernoff’s bound). Let Ui , ..., be real i.i.d. random vari- 
ables. Denote their sum by Sn = Then, for every e > 0, we have : 

Pr {|S'„ - ES'„| > e} < 

where A is the Cramer Transform of the random variable U\ . 
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Abstract. We show that there exist individual lower bounds corre- 
sponding to the upper bounds on the rate of convergence of nonpara- 
metric pattern recognition which are arbitrarily close to Yang’s minimax 
lower bounds, if the a posteriori probability function is in the classes used 
by Stone and others. The rates equal to the ones on the corresponding 
regression estimation problem. Thus for these classes classiheation is not 
easier than regression estimation either in individual sense. 



1 Introduction 

Let (A, F), (Ai,Yi), (A 2 ,F 2 ),--- be independent identically distributed TZ^ x 
{0, 1} -valued random variables. In pattern recognition (or classification) one 
wishes to decide whether the value of Y (the label) is 0 or 1 given the {d- 
dimensional) value of X (the observation), i.e., one wants to find a decision 
function g defined on the range of X taking values 0 or 1 so that g{X) equals to 
Y with high probability. Assume that the main aim of the analysis is to minimize 
the probability of error : 



mmL{g) = minP{g(A) F} . (1) 

9 9 

def 

Let ri{x) = P{F = 1|A = a:} = E{F|A = x} be the a posteriori probability (or 
regression) function, let 



^ \ 0 else 



be the Bayes-decision, let 

L{g*)=P{g*{X)^Y} 

* The author’s work was supported by a grant from the Hungarian Academy of Sci- 
ences (MTA SZTAKI). 



P. Fischer and H.U. Simon (Eds.): EuroCOLT’99, LNAI 1572, pp. 241-|2^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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be the Bayes-error and denote the distribution of X by fi. Introduce \\f\\q = 
It is well-known, that for each measurable function g : TZ‘^ 

{0, 1} the relation 



L{g) -L* = 2 







(2) 



holds. Therefore the function g* achieves the minimum in and the minimum 
is L*. 

For the classification problem, the distribution of (X,Y) (and so 77 and g*) is 
unknown. Given only a sample D„ = {(ATi, Yi), . . . , (Al„, ¥„)} of the distribution 
of {X, Y), one wants to construct a decision rule gn{x) = gn{x, Dn) ■ x {TZ‘^ x 
{0, 1})" 1-^ {0, 1} such that 



L„ = L(g„) = P{g„(^) Y\D^} 



is close to L* . In this paper we study asymptotic properties of EL„ — L* . 

If we have an estimate r]n of the regression function 77 and we derive a plug-in 
rule gn from ?7„ quite naturally by 

„ 7^) _ / I if ^n{x) > 1/2, 

9n{X) - 

then from (0 we get easily 

Ln- L* < 2 ||t 7 „ - 77II1 < 2 ||t 7 „ - 77H2 

(see Devroye, Gyorfi and Lugosi jS|). This shows that if ||t7„ — 77H1 ^ 0 then 
Ln — *■ L* in the same sense, and the latter has at least the same rate, i.e., 
classification is not more complex than regression estimation. 

It is well-known, that there exist regression estimates, and so classification rules, 
which are universally consistent, i.e., which satisfy 



ELri 



for all distributions of {X,Y). This was first shown in Stone |S] for nearest 
neighbor estimates (see also |3| for a list of references). 

Glassification is actually easier than regression estimation in the sense that if 
E|bri — v\\i 0, then for the plug-in rule 



EL„ - L* 

v/E{||77„-77||2} 



(3) 



(see jS| Ghapter 6), i.e. the relative expected error of gn decreases faster than 
the expected £2(9) error of ?7„. (Moreover, if ||t 7„ — 77H1 — s- 0 a.s., then for the 
plug-in rule 



Ln-L* 

\\Vn-riW2 



0 a.s. 
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(see Antos Q), i.e. the relation also holds for strong consistency.) However the 
value of the ratio above can not be universally bounded, the convergence can 
be arbitrary slow. It depends on the behavior of rj near 1/2 and the rate of 
convergence of rin- 

2 Lower Bounds 

Unfortunately, there do not exist rules for which EL„ — L* tends to zero with a 
guaranteed rate of convergence for all distributions of (A, Y). Theorem 7.2 and 
Problem 7.2 in jOI imply the following slow rate of convergence result: Let {o„} 
be a sequence of positive numbers converging to zero with 1/16 > ai > 02 > . . .. 
For every sequence of decision rules {gn}, there exists a distribution of (A, U), 
such that A is uniformly distributed on [0, 1], 77 S {0, 1} {L* = 0) and 

^ cLji 

for all n. 

Therefore one has to restrict the class of distributions, which one considers, 
in order to obtain nontrivial rate of convergence results. Then it is natural to 
ask what the fastest achievable rate is for a given class of distributions. This 
is usually done by considering minimax rate of convergence results, where one 
derives lower bounds according to the following definition. 

Definition 1. A sequence of positive numbers an is called lower rate of conver- 
gence for a class V of distributions of (A, A), if 

, ELn-L* 

iim inf sup . 

n^oo g„ (x,F)G’D 

Yang mg pointed out that while (0 holds for every fixed distribution for which 
? 7 „ is consistent, the optimal rates of convergence for many usual classes are the 
same in classification and regression estimation. He shows many examples and 
some counterexamples to this phenomenon with rates of convergence in term of 
metric entropy. Classification seems to have the same complexity as regression 
estimation for classes which contain functions with many values near 1/2. (See 
also Mamman and Tsybakov 0.) 

E.g. it was shown in mg that for Lipschitz classes below, the optimal 

lower rate of convergence is , the same as for regression estimation (see 

also Stone PI). For lower bound results on other types of distribution classes 
(e.g. Vapnik-Chevonenkis classes) see |S| and the references there. 

Definition 2. Let be the set of functions f : — *■ TZ such that for 

p = k + f), k G A/q, 0 < /3 < 1 

\D'^f{x)-D‘^f{z)\<M\\x-zf, 

where D°‘ denotes the partial derivatives for a = (oi, . . . , Od), cti G Afo, 

= k. 
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Definition 3. LetV^P’^^ be the class of distributions of{X,Y) such that 
(i) X is uniformly distributed on [0, 1]^^, 

(a) T]e 



It is well-known, that there exist regression estimates rjn, which satisfy 



^ ^mrin-vWl} , 

lim sup p < oo 

n— >CXD (x,Y)gX>(p.m) n 2 p+d 

(see, e.g., Barron, Birge and Massart g]), thus for the plug-in rules Qn 

EL„ - L* 



lim 



sup — p 

' (x,y)gx><p.“) n 2 p+d 



< oo, 



( 4 ) 



( 5 ) 



i.e., sup(jf y)gx>(p.“) ~ ^*) = O (ji 2 p+d^ . (This remains true leaving con- 

dition (i) from Definition 0 Note that the rate does not depend on M.) 

Theorem 1. (Yang 1 1 ( llj 1 The sequence 

_ P 

Qn = n 2p+3 

is a lower rate of convergence for the class T>^pXT ^ 



In some sense, such lower bounds are not satisfactory. They do not tell us any- 
thing about the way the probability of error decreases as the sample size is 
increased for a given classification problem. These bounds, for each n, give in- 
formation about the maximal probability of error within the class, but not about 
the behavior of the probability of error for a single fixed distribution as the sam- 
ple size n increases. In other words, the “bad” distribution, causing the largest 
probability of error for a decision rule, may be different for each n. For example, 
the lower bound for the class does not exclude the possibility that there 

exists a sequence of rules {gn\ such that for every distribution in the 

expected probability of error EL„ — L* decreases at an exponential rate in n. 
In this paper, we are interested in “individual” minimax lower bounds that 
describe the behavior of the probability of error for a fixed distribution (X,Y) 
as the sample size n grows. 

Definition 4. A sequence of positive numbers an is called individual lower rate 
of convergence for a class T> of distributions of {X, Y), if 



inf sup lim > 0, 

ISi*} (A,y)G'D ^ri 

where the infimum is taken over all sequences {gn} of decision rules. 
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The concept of individual lower rate has been introduced in Birge 0 concerning 
density estimation. For individual lower rate results concerning pattern recogni- 
tion see Antos and Lugosi Pj. 

We will show that for every sequence {bn} tending to zero, bnn is an indi- 
vidual lower rate of convergence of the class Hence there exist individual 

lower rates of these classes, which are arbitrarily close to the optimal lower rates. 
These rates are the same as the individual lower rates for the expected /l 2 (p) 
error of regression estimation for these classes in Antos, Gyorfi and Kohler | 2 |. 
(Actually the results here imply some results there.) Both individual lower rates 
are optimal by (0) and (0, hence we extended Yang’s observation for the indi- 
vidual rates for these classes. 

Our results also imply that the ratio el„-l tend to zero arbitrary 

slowly (even for a fixed sequence {r]n})- 

Theorem 2. Let {bn} be an arbitrary positive sequenee tending to zero. Then 
the sequence 

2p 

6„a„ = bnU 2p+d 

is an individual lower rate of convergence for the class ^ 

Remark 1. Applying for the sequence {^/b{{}, Theorem Q implies that for all 
{gn} there is (X,Y) € such that 

-—ELn - L* 
lim — = oo . 

n^oo 6„a„ 

Remark 2. Certainly Theorems [D and 0 hold if we increase the class by leaving 
condition (i) from Definition 0 

Call a sequence c„ an upper rate of convergence for a class T>, if there exist rules 
gn which satisfy 

ELn - L* 

lim sup < oo, 

n^oo (^x,Y)eT> 

i.e., sup(j(- ~ ^*) = 0(cn), and call it an individual upper rate of 
convergence for a class T>, if there exist rules which satisfy 

EL„ - L* 

sup hm < oo . 

This implies only that for every distribution in T>, ELn — L* = 0(c„), possi- 
ble with different constants. Then m implies that n ^p+j is an upper rate of 
convergence, and thus also an individual upper rate of convergence for 
While Theorem 0 shows only that there is no upper rate of convergence for 
p)(pM) better than n~^^, it follows from Theorem 0 that n~ ^p+t* is even the 
optimal individual upper rate for in the sense, that there doesn’t exist 

an individual upper rate c„ of convergence for which satisfies 

lim — "p = 0 . 

TL *00 2p+d 
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Moreover Q and m imply 

-F — EL„-L* ^ 

ini sup lim p — = 0, (d) 

{9^^} (jv,y)GX>(p'“> n '^p+‘‘ 

which shows that Theorem 0 can not be improved by dropping bn. This shows 
the strange nature of individual lower bounds, that while every sequence tending 
to zero faster than is an individual lower rate for itself 

is not that. 



3 Proofs 



The proofs of the theorems apply the following lemma: 



Lemma 1. Let u = (ui, . . . ,ui) be a l-dimensional real veetor taking values in 
[—1/4, 1/4]*, let C he a zero mean random variable taking values in {— 1,+1}, 
and let Yi, . . . ,Yi be indepedent binary variables given C with 

P{r, = 1|C} = i + Cw, i = i,...j . 

Then for the error probability of the Bayes decision for C based on 

“ 4 



Proof. The Bayes decision is 1 if P{C = 1|T} > ^ and —1 otherwise, therefore 
L* = E{min(P{C' = l|y},P{C = -IjF})} = E{7 t} . 

One can verify that 



P{C= ijy} = 



T 



T+V 



where 



TT I 2 r I 2 r 

^ “ 11 i I + 1 , 



i<l ^2 



n +Ui 



n 



1 V 2Yi-l 

2 '^i 



<l \ 2 









where Zi (21/ — 1) log((l + 2ui)/(l — 2ui)). For arbitrary 0 < g < 1/2, n > q 
if and only if | logTj < log therefore 



E{7t} > gP{7T > q} 



qP \ [logT] < log 



1-9 



9 
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By Markov’s inequality 



P |logT| < log 



1-q 



> 1 - 



E|logT| 



Moreover because of | logP| = 

[ 



we get 



E|logT| = E 



i<i 



< 



\ 



y i 



i<l 



= /^e{z2} + ^e{z,z4 . 

y i 

Using the inequality for — 1/4 < a: < 1/4 
1 + 2a; 



log- 



l-2a; 
on the one hand 



= I log(l + 2a;) - log(l - 2a;)| = log(l + 2|a;|) - log(l - 2|a;|) 
< 2|a;| + log 4 • 2|a;| < 5|a;|, 






< 25^2 



and on the other hand 

E{(2F, - l)(2ri, - 1)} = 4E{y,r,,} - 2E{FJ - 2E{Y,,} + 1 
= 4(E{r,y,/C = 1}P{C = 1} + E{Y,Yi,\C = -1}P{C = -1}) - 1 



1 



1 



1 fl 



= 4 hr + w* hr + w*' hr + hr - Wi hr - hr - 1 = 



1 



so 



E{Z,Ze} = E{(2F, - 1)(2U,, - 1)} log log 



1 — 2ui 1 — 2ui. 



1 + 2ui 1 + 2ui‘ 

= 4uiUi> log — log ■ 



1 — 2ui 1 — 2uii 



< 4|t 



log 



1 + 2ui 



1 — 2uj 



log 



1 + 2ui' 



1 — 2m,: 



< 100m/m/, . 



Hence 



Thus 



E|logT|< /25^Mf + 100^MfMf, . 
y i i/i' 



E{7t} > q 1 — 



E|logT|\ ( Sy'Ei wf +4E,5^,' uWi' 






><z 1- 



log 



1-9 
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By choosing q ^ (l + exp (l + “f + 4 Ei/i' ) , 

E{7t} > ^ 



> 



1 + exp(l + +4,J2i^i' Uiul) 1 + 

1 



(e + 1)£ 



10./^. “f+4^^ , “f“;/ 



- 4® 



Proof of Theorem u This proof differs from that of Yang, it can be easily 
modified to individual lower bound in Theorem 0 First we define a subclass 
of distributions {X,Y) contained in We pack infinitely many disjoint 

cubes into [0, 1]^^ in the following way: For a given probability distribution {pj}, 
let {Bj} be a partition of [0, 1] such that Bj is an interval of length pj. We pack 
disjoint cubes of volume p'j into the rectangle 

Bj X [0, . 



I d-l 



Denote these cubes by where Sj = 

center of Aj,k- Choose a function m : TZ‘^ — > [0, 1/4] such that 

(I) the support of m is a subset of 

(II) J m(x) dx > 0, 

(III) m e 

The class of a posteriori probability functions is indexed by a vector 



Let a, fe be the 



C — (Cl,l, Ci_2) 



• C2p,C2,25 ■ ■ ■ T (-2,82 j ■ • ■) 



of +1 or —1 components. Denote the set of all such vectors by C. For c G C define 
the function 

^ 00 Sj 

^j,k'^j,k (^) 5 

i-1 fc-1 

where mj^k{x) = p^m{pj^{x — aj^k))- Then it is easy to check (cf. 0, p. 1045) 
that p^’^'>{x) G [0, 1] for all x G [0, l]'^ and because of (III) 

^(c) g j:{p,M) 



Hence, each distribution (X,Y) with Y G {0,1} and E{Y|Y = x} = PjY = 
1|Y = x} = r]^'^\x) for all x G [0,1]'^ for some c G C is contained in 
which implies 



lim inf sup 

n^oo g„ (JC,y)gX>(P.") 



EL„ - L* 

dn 



> lim inf sup 

n^oo g„ (^x,Y):E{Y\X=x}=p('=1(x),ceC 



EL„ - L* 

O'n 



( 7 ) 
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Let gn be an arbitrary rule. By definition, {Ia^ fc/2 : j, k} is an orthogonal system 
in C 2 {v) for the measure t^{A) = J k therefore the projection gn — \ 

A 

of gn - ^ is given by 



ffn(x) - 



1 

2 



Vc k^ 

j,k 



Ax) 



where 



^n,j,k 



[ (gn — 1/2) diy 
J(gn-ll2)lA^j2di. X, 

A{lA,j2Afdv J 



J (gni^x) - l/2)myfe(a;) dx 



f mj^k(x)dx 

Aj,k 



J gn(x)mj^k(x) dx 



f mj^k(x) dx 

-^j,k 



(Note that Cnj,k G [—1,1].) Let c € C he arbitrary. Note that g* = ^ + 
Then by 0) 



Ln -L* = 2 



J ^ 2 d{g„^g*} dg — 2 J ^ ^ ™j,fc(gn 9 ) dfj, 

i.fc 




i.fc 



Let Cn. j,k be 1 if Cnj^k > 0 and —1 otherwise. Because of — cy^j > 

we get 



Ln-L*> -\\m\ 



E j PA-d 

{Cn,j,k^Cj^k}Pj 

j,k 



This proves 

EL„ - L* > i||TO||ii?„(c), 

where 

s, 

din{c) = Y. UpT" ■ P{Sn,,.fe C,.4 . 

2p + ci ^1 Ia — 1 

j -. np . <1 



( 8 ) 

( 9 ) 
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Equations m and 0 imply 



lim inf sup > — ||?7i||i lim inf sup — - — . (10) 

n— s-oo Qn 2 n^oo Qn 

To bound the last term, we fix the rules and choose c G C randomly. Let 

be a sequence of independent identically distributed random variables indepen- 
dent of Ail, X 2 , . . . , which satisfy P{Ciq = 1} = P{Ciq = —1} = 1/2. Next 
we derive a lower bound for 

s, 

Ei?„(C) = ^ ^ Q.fc} ■ 

ynp^'’+'^<l fc=l 

Cnj,fe can be interpreted as a decision on using Zl„. Its error probability is 
minimal for the Bayes decision Cn,j,k, which is 1 if P{Cj^k = 1|I?„} > ^ and — 1 
otherwise, therefore 



Let Xj, be those X^ G Aj^k- Then given Xi,..., X„, (1^^ , . . . , Ei, ) is 

distributed as (Yi, . . . ,Yi) m the conditions of Lemma E with Ur = mj^k{^ir)t 
while 

(Yi,...,y„)\(y,,,...,YO 

depends only on C \ {C/,fc} and on Ai^’s with r ^ {ii, . . . , q}, therefore is inde- 
pendent of Cj^k given Xi, . . . , Ai„. Now conditioning on Aii, . . . , X„, the error of 
the conditional Bayes decision for Cj^k based on (Yi, . . . ,1^) depends only on 
(lij , hence Lemma E implies 



P{Cn,j,k^C,,k\Xi,...,Xn}> 



1 -10 
4® 






^i' 3,k 



(X.W jxp) 



By Jensen-inequality 



P{Cnj,fc y^ Ci,fe} = E{P{Cn,y,fc y^ Q.ftlAli, . . . , Ai„}} 

r 



1 E{m^ 

> -e V 

“ 4 

1 —10^ J ^2 -n(n— 

~ 4 ® 



Convergence of Nonparametric Pattern Recognition 251 



independently of k. Thus if < 1 



Q,fc} > L-^0\\m\\,^l+M\m\\l > 0^ 



and 



Ei?„(C) > ^ ^ • P{a,,. 1 ^ C,- 1} 

2p+ d ^ 1 
j-.np^ <1 

1 



> _g- 10 ||Tn|| 2 Y^l+ 4 ||m|| 

“ 4 



where iCi = ie-i°ll™ll"Vi+4|l™lli(i/2)^-i. Setting pj = = {l/n)^ for 

1 

j < , 

Ei?„(C) > RTi[n2FMjn“^^ = itTia„(l - o(l)), 



so 



lim inf sup > lim inf — > 0 . 



(12) 



n^oo g„ a„ n^oo g„ On 

This together with (1 1 1 )ll implies the assertion. □ 

Proof of Theorem 121 We use the notations and results of the proof of The- 
orem n Now we have by (0 

,, ^^Rn{c) 

inf sup lim — > — ||m||i inf sup fim — . (Id) 

(X,Y)GT>(P’^^ n— »C30 bndn 2 {dn} ceC OnCln 

In this case we have to choose {pj} independently from n. Since bn and a„ 
tend to zero we can take a subsequence {nt}tej\f of {n}nej\f with < 2“‘ and 
ah{^ < 2“b Define qt such that 

2"‘ 
qt 



i/p 

O-nt 



and choose {pj} as 

where qt is repeated 2~*/qt times. So 

2"‘ 



E pT'= E 



2p-\-d ^ T 
<1 






qt 



■qr^ > E 



•uA 



t-.nq^P+'^<l 



E 









V 



.,1/p 



> E 



E7I j_ 1 

i/p ' ^ 



= E < 

t:n± >n 



i/p 



^ E 



^nt ^n± 



i -h ^ ant / t:nt>n 
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by o^rl^ <2 *, and specially for n = Ug 11 111 implies 






„p+i 






K, 



2p -|- (i -1 

j:risp/^ <1 



2P 



t> s 



2P 



(14) 



Using onj one gets 



inf sup lim 



■Rn{c) 



> inf sup lim 



■RnAc) 



{Sr.} cGC bnO-n {gr^} cGC s^oohn^a. 



> ^r— inf sup lim 



RnXc) ^ ^1 ■ r T? I r 

> inf E lim 



RnAC) 

2^^ {s„} cGC “ 2P {g„} ^S^oo Ei?„^ (C) 

Because of (CU and the fact that for all c G C 



Rn{c)< Y. E 



„P+1 



j'.np - <1 



2p + d^T 

j:npY <1 



the sequence i?„^(C')/Eii„^(C) is uniformly bounded, so we can apply Patou’s 
lemma to get 



■ r ^ ■ r T 

mi sup iim > — mi iim hi 

{9n} cGC Onan 2^ { 5 ^} s^oo 



( RnSC) \ 
\ERnSC)J 




This together with PD implies the assertion. 



□ 
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On Error Estimation for the Partitioning 
Classification Rule 
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1521 Stoczek u. 2, Budapest, Hungary. 
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Abstract. The resubstitution and the deleted error estimates for the 
partitioning classihcation rule from a sample {Xi, Yi), . . . , {X„, Y„) are 
studied. The random part of the resubstitution estimate is shown to be 
small for arbitrary partition and for any distribution of (X,Y). If we 
assume that X has a density / and the partitions consist of rectangles, 
then the difference between the expected value of the estimate and the 
Bayes error restricted to the partition is less than a constant times 
The main result of the paper is that, under the same conditions, the 
deleted estimate is asymptotically normal. 

1 Introduction 

Let X be the c?-dimensional feature vector with distribution /r, and let Y be the 
binary valued label. Throughout the paper it is assumed that X takes its value 
in a bounded region X C . Denote the aposteriori probabilities by 



In pattern recognition the value of the label Y is to be predicted upon observing 
the feature vector X. The prediction rule or classifier g is a function X {0,1}, 
whose performance is measured by the probability of error 



is well-known to have minimal probability of error among all possible classifiers. 
Its error probability L{g*) is called the Bayes error, and is denoted by L*. 



Assume that n independent copies of (A, Y) form the available data sequence: 



P,(x) = Pjy = i\X = x},i = 0, 1. 



L{g) = Pjg(A) ^ Y}. 



The Bayes classifier 




0, if Pi(a:) < 1/2 

1, otherwise. 



L* = E{min(Po(A),Pi(A))|. 
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These data may be used to design the classification rule gn{x), whose probability 
of error is the random variable 

Ln = L{g„) = P{gn(X) ^ Y\D„}. 

Many important classification rules partition X into disjoint cells and classify in 
each cell according to the majority vote among the labels of the XiS falling in 
the same cell. Let Vn = {Anj,j = 1,2, . . .} be a partition of X, let m„ denote 
the number of cells in the partition, and let A„,{x) denote the cell in the partition 
that includes x, then the partitioning classification rule becomes: 

r 0 if X;r=i ^{Yi=i}I{XieA„{x)} 

9n{x) = < < ^{Yi=0}I{XieA„{x)} 

[ 1 otherwise 

Estimating the error probability of a classification rule g„ is of great importance. 
The designer always wants to know what performance can be expected from a 
classifier. Since the distribution of the data is unknown, it is important to find 
and analyze error estimation methods that work well independently from the 
distribution of {X,Y). 



2 Resubstitution estimate 

The resubstitution estimate L„ counts the number of errors committed on the 
training sequence (Xi,Yi),{X2,Y2), . . . ,{Xn,Yn) by the classification rule, i.e. 
for a classifier (/„ it is defined as 

1 

which for the partitioning classification rule can be written in the following form: 

rrin 

i=i 

where 

^ n 1 ^ 

= ~'^I{XiGA}Yi and ^„(A) = - ^ 

” i=l * i=l 

Ln is an estimate of the Bayes error i?* restricted to the partition Vn' 

m-n 

Rn — ^ An j) , fi{Anj) — h'{Anj)}, 

i=i 

where 

iy{A) = PVn{A). 
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Concerning the resubstitution error estimate for partitioning rule, the following 
inequalities are known (see Sec. 23.2 in Devroye, Gyorfi and Lugosi (1996)): for 
an arbitrary partition 

Var{Ln) < - 
n 

and 

EL„ < Rl. 

For a finite partition of size 

- EL„ < (1) 

V n 

The resubstitution estimate for the partitioning rule is asymptotically normal 
under certain conditions: 



Lemma 1. (Gyorfi, Horvath (1998)) Consider the partitions where Anj are d- 
dimensional rectangles. Let a\^{x), i = 1,2 , ..., d, denote the sidelengths of An{x). 

Assume that for all x there exists a K{x) so that < K{x) for all 1 < i, j < d 
and n. Assume that 

lim swp diam(Anj) = 0 

n—*oo A 

■^rtg 



and 



and 



lim ^ 

n — >^oo Ji 



= 0 



,. logn 
lim ' , . , , , 
n^oo nX[An[x)) 



(2) 



where A is the Lehesgue measure. If pi has a density f and there is a constant c 
such that 

Ih'iAnj') 2r'(j4^j)| > cp,{^Ajij^ (3) 

for all n and j, then 



® fV(0, 1). 



Remark 1. For cubic partitions with size these conditions mean ^ 0, 
nh(f oo and nh‘f/ logn ^ oo as n ^ oo. 

The random part of the estimate L„ is small for arbitrary partition and for any 
distribution of (X,Y): 

Theorem 1. For the resubstitution estimate for the partitioning rule, and for 
all n and e > 0, 

P{ni/2|L„-EL„| >e}<2e-2^'. 
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Using this result we can remove from the upper bound in O- 

Theorem 2. For any distribution of (X,Y) and for n large enough for the 
estimate Ln of the error probability of a partitioning rule satisfying the conditions 
of Lemma 0 

Vti 



3 Deleted estimate 



The deleted estimate or cross-validation attempts to avoid the bias present in 
the resubstitution estimate. The method deletes (Xi,Yi) from the training data 
and creats a classifier gn-i using the remaining n — 1 pairs. It tests for an error 
on (Xi,Yi), and repeats this procedure for all n pairs of the training data 
Formally denote the training set with {Xi^Yi) deleted by 

Dr,,i = ((Xi, Ti), . . . , (X,_1, (X,+1, y,+i), . . . , (X„, y„)). 

Then define 

1 ” 
i=l 

Clearly, the deleted estimate is almost unbiased in the sense that 

YiLji = FjLji—i. 



Concerning the deleted error estimate for partitioning rule the following is known 
(see Sec. 24.5 in Devroye, Gyorfi and Lugosi (1996)): for an arbitrary partition 







1 + 6/e 

n 



6 

y^7r(n - 1) 



The resubstitution estimate for any partitioning classification rule is smaller than 
the deleted estimate: 

+ Lnj (4) 

since if (Xi,Yi) is a mistake w.r.t. L„, i.e. Yi yf gn(Xi), then the label 1/ is in 
the minority among the labels of the data falling into its cell. Then, of course, 
Yi is in the minority of its cell w.r.t. Dn^i = Dn \ {Xi,Yi\, which implies that 
(Xi,Yi) is a mistake w.r.t. 

The main aim of the paper is to derive the asymptotic normality of the distri- 
bution of Ln- 



Theorem 3. Under the conditions of Lemma^ 

n+2 _ i?;) /7F72 ® N{0, 1). 

The deleted estimate for partitioning classification rule is asymptotically normal 
under similar conditions as the resubstitution estimate. 
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4 Proofs 

Proof of Theorem 1 The result comes from a straightforward application of 
McDiarmid’s inequality: 

Lemma 2. (McDiarmid (1989), for the proof see Devroye, Gyorfi and Lugosi 
(1996)) Let Zi,...,Zn he independent random variables taking values in a set 
A, and assume that the measurable function F : 4” ^ TZ satisfies 

sup \F{zi, . . . , Zi, . . . , Zn) - F{zi, ...,z'i,.. .,Zn)\ < Ci, I < i < n. 

Zi ,...,Zn,z' 

Then for all e > 0 

P{|T(Zi, . . . , Z„) - ET(Zi, . . . , Z„)| > e} < 

Let Zi = {Xi, Yi) and the set A is {TZ^^ X 0, It can be easily seen that in case 
of the resubstitution estimate for partitioning rule Ci = ^. □ 

Proof of Theorem 2 

From the result on the asymptotic normality 

where is the standard normal distribution function. Since i?* — EL„ is not 
random 

-EL„|>e} = ~ EL„| > e} 

< P{^AI(|L„ - EL„| + |L„ - i?;|) > e} 

< P {^/n\Ln — EL„| > e/2} 

+P{V^|L„-i?:| >e/2| 

therefore 

+ 1 - [-^) - 

Obviously there exist such eg for which the righthand side of the inequality is 
strictly smaller than 1, and then 

for sufficiently large n. Using the trivial upper bound 

L>{e/V^) > 1 / 2 , 

eo = \/2 In 2 « 1.18 is a valid choice. 



□ 
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Proof of Theorem 3 

It is easy to see that if C/„ and are random variables and C/„ — > iV(0, 1) and 
— > 0 in probability, then Un + Vn ^ iV(0, 1). 

So, since the conditions are the same as for Lemma0 it suffices to show that 

\/r > 0 

in probability as n ^ oo. 

It is well known that if, for a nonnegative random variable Z„, EZ„ ^ 0, then 
Zn — > 0 in probability. Thus, since from 0) > 0, we have to prove that 

\/rE(L„ — Ln) — > 0 



as n — > oo. 

For the partitioning classification rule the deleted estimate can be written in the 
following form: 



L 



n 



m-n 

i .^ ri { Anj ) I { nv „( Anj )- l < n { nn ( A „ j )- Vn ( Anj ))} 

i=i 



Therefore 



Ln) 

{ m„ 

E (jniu^l'n {-'^nj ) j Mn {-'^nj ) {-^nj ) } 

j=l 

+ (Mn(^nj) ~ ^ni-^n.j))I{n{it.n{A„j) — Vn(Anj)) — Kni^n^Anj )}))} 

{ m-n 

E( l'n{-Anj)I{niJn(Anj)<n{fj.niA„j)-Un(Anj))} 

i=l 

+ {lJ'nAnj) — ^nAnj))I{n{fi„{Anj)-v„(A„j))<nu„{A„j)} 
ij^n )^^nVn (^nj) (^nj ) )} 

'L{fJ-n{-^nj) ^’ni^nj))I{n{it.n{A„j) — i/.n(Anj)) )}))} 



{ rrin 

{^nAnj)I{nVn{Anj)-l<n{fln(A^j)-Vn{Anj)),ni/n{Anj)>n{tJ.n{Anj)-l/n{Anj))} 

J = 1 

~l~(f^n(Anj) ~ ^n(^nj)).f{n(/r„(A„j)-i/„(A„3))-l<ni/„(A„j),n(/i„(A„j)-!/„(A„j))>ni/„(A„3)}) } 
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{ rrin 

E( t^n{^nj)I{niJ^{Anj)=n{fin{Anj)-iJ„(Anj)) + l} 

j=l 

+ {fJ'n{Anj) — ^n{Anj))I{n(/j.„(A„j)-i^„(A„j))=ru^„(A„j}})} 
rrin 

= AI'£e{ ^n{Anj)I{ruyn(Anj)=n(fln(Anj)-Vn(Anj)) + l} 

i=i 

+ (Mn(^ni) ~ ^n(Anj))I^n(p„(A„j)-iy„(A„j))=niy„(A„j)}} 
rrin 

= v«^(E{.„ 

{^nj)I{niyn,iAri,j)=n{fj.„{Ari,j)-iy„{A ni)) + l}} 

i=i 

+ E |(/r„(Anj) — l^n(Anj))I{n(fi„(A„j)-i^„(A„j))=m„(A„j)}}) ■ ( 5 ) 

Consider the first term of the sum for the cell A = Anj : 



E {i^n(^)d{n!y„(A)=n(/r„(A)-!y„(A)) + l} } 

n 

(^)-^{ni^n(A)— n(rtn(A) — I^n(A)) + l} 

fc =0 

nl2 

— 'y ' El {l^n(^)d{n!y„(A)=n(rin(A)-!y„(A)) + l} = 2fc + l} P{n^„(Al) = 2fc + 1} 

fc =0 

= V ^ ( M-v{a) \ ^ 

^ n lv^ + iyVM(^)/ V 7 

n j 2 

= E ^ (“70 (21"+ 1) 

7T. j 2 

= E ^ (t) (2/fc + 1) 

7T. j 2 
n/2 

< 22 '=i/(^)'=+ 1(^(.4) - v{A)f 

fc^O 

, .. n/2 . . 

= r^) E - y{A)f (^2^ (1 - 
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71/ j 2 

= E (;,) (VK^Km(,4) - .(,4»)“ (1 - 

S (t) (2V-(4)W4)-^(4)))‘(1 - M4))"-‘ 

= (2v'K^)(M^)-K^)) + 1 - f,{A )) ” 

- (2 TKiyRIFKl)) + 1 - m(A)) ” ( 6 ) 



The second term in can be bounded similarly: 

E{(^n(^) ~ ^n(^))d{n(/r„ (A) — i/„(A))=n;y„(A)} } 
n 

= Ee 

fc^O 

n/2 

“ ^ ^ ^ { (Mti(^) ^n(-^))-^{n(/Xri(^) — i^n(>l))— 'ZT'i^n(A)} “ 2 /c | P {?T./Xyi ( A) = 2A:} 

fc^O 
n/2 






n — 2k 



fc=0 

n/2 



= E 

fc=0 
nl2 



k f 2k 



%—2k 



v{A)\n{A)-v{A)f 

< - E 2fc2^'=K^)'=(M(^) - ^{A)f ( ^ (1 - ^M{A)) 
= ^E2fc(A (2^v{A){i,{A) - KA)))'' (1 - 

fc =0 ^ 

1 ^ /" \ t 

fc =0 ^ ^ 

= ^ (2\/j^(H)(^(H) - + 1 - fi{A)^ 



(^^v{A){^{A) — v{A)) + 1 — 

(2 x/k^)(m(a)-k^))) 

fc^O ^ 

(2\/j^(H)(^(H) - v{A)) + 1 - fi{A)j 
E < Binom n, 



‘2\/^iA){KA) - v{A)) 



2^v{A){iJ.{A) - v{A)) + 1 - /r(A) 
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^l{A) 



< 



{2^Jv{A){^i{A) - v{A)) + 1 - fi{A)^ 



1 - n{A) 

From this upper bound and from © and 0 

\/^E(Zy^ Lrt) 



< 



^ j ^^]j’^(^nj){fJ-(Anj) — v(Anj)) + 1 — ^i(Anj)^ 



rrin . . 

^ 2 y^(-A . ^ it^{-^‘nj) — l^{Anj )) j 



i=i 



= VT,Y, 






where the second inequality comes from the fact that 1 — z<e ^ifz>0, 
and clearly ^{Anj) — 2^v{Anj){^i,{Anj) — iy{Anj)) > 0, and we can assume that 

fJ.{Anj) < 112 . 

The condition (0 means that 

1 



2 



c 

^ 2 ’ 



and because of this 



u{Anj) f. i^iA„j) \ ^ 1 c- 



1 - 






V ) / 4 4 

Thus, if we denote 1 — = 1 — \J\ — (? by S(c) then 

"*n -n/r(A„j) I 1-2 

V^E(Z/^ Tyj) ^ \/n ^ ^ 2/i(Aj^j )e ^ 

i=i 

rrin 

<v«E 



1 1 




y ^ 


e(>i„3 J 



i=i 



"f-n 2 

= 77 ^^(c)’^Ai(^ni)e 

o[c)n 

j=i 



-S(c)nfi.(Anj) 



< 



^ <5(c)i 



< ^/nrur, 



■ max ze 



2mr, 



6{c)n 6{c)y/n 



because of condition ©• 
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Abstract. A number of results have bounded generalization of a clas- 
sifier in terms of its margin on the training points. There has been some 
debate about whether the minimum margin is the best measure of the 
distribution of training set margin values with which to estimate the gen- 
eralization. Freund and Schapire [B| have shown how a different function 
of the margin distribution can be used to bound the number of mistakes 
of an on-line learning algorithm for a perceptron, as well as an expected 
error bound. We show that a slight generalization of their construction 
can be used to give a pac style bound on the tail of the distribution of 
the generalization errors that arise from a given sample size. We also 
derive an algorithm for optimizing the new measure for general kernel 
based learning machines. Some preliminary experiments are presented. 



1 Introduction 

The idea that a large margin classifier might be expected to give good gener- 
alization is certainly not new i5in . Despite this insight it was not until com- 
paratively recently m that such a conjecture has been placed on a firm footing 
in the probably approximately correct (pac) model of learning. Learning in this 
model entails giving a bound on the generalization error which will hold with 
high confidence over randomly drawn training sets. In this sense it can be said 
to ensure robust learning, something that cannot be guaranteed by bounds on 
the expected error of a classifier. 

Despite successes in extending this style of analysis to the agnostic case P 
and applying it to neural networks Pj, boosting algorithms 0 and Bayesian 
algorithms P|, there has been concern that the measure of the distribution of 
margin values attained by the training set is largely ignored in a bound that 
depends only on its minimal value. Intuitively, there appeared to be something 
lost with a bound that depended so critically on the positions of possibly a small 
proportion of the training set, ignoring the margin attained by the majority of 
the points. 

Freund and Schapire |0| (a similar technique was employed by Klasner and Si- 
mon P for rendering a real valued function learning algorithm noise tolerant) 
developed a measure of the margin distribution which they showed could be 
used to bound the expected generalization error more tightly than the minimal 
margin. The aim of this paper is to show that the same measure can also be used 
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to provide a pac style bound on the generalization error. We will also develop an 
algorithm for a modified kernel based linear machine which directly optimises 
the new measure. 



2 Background Results 

We first give some necessary definitions. 

Definition 1. Let H be a set of binary valued functions. We say that a set of 
points X is shattered by H if for all binary vectors b indexed by X, there is a 
function fj, G H realising b on X . The Vapnik-Chervonenkis (VC) dimension, 
VCdim(i/), of the set H is the size of the largest shattered set, if this is finite 
or infinity otherwise. 



Definition 2. Let H be a set of real valued functions. We say that a set of 
points X is y-shattered by H if there are real numbers Vx indexed by x G X such 
that for all binary vectors b indexed by X, there is a function fbGH satisfying 

J V 1 < Ta; — 7 otherwise. 

The fat shattering dimension fat^f of the set H is a function from the positive 
real numbers to the integers which maps a value 7 to the size of the largest 
'j-shattered set, if this is finite or infinity otherwise. 



We will make critical use of the following result contained in Shawe- Taylor et 
al pn] which involves the fat shattering dimension of the space of functions. 



Theorem 1. Consider a real valued function class TL having fat shattering func- 
tion bounded above by the function afat : > Af which is continuous from the 

right. Fix 9 G ift. Then with probability at least 1 — S a learner who correctly 
classifies m independently generated examples z with h = Tg{f) G Tg{Ti) such 
that erz(/i) = 0 and 7 = min |/(xd — 6\ will have error of h bounded from above 
by 



e(m, k,S) = — 
m 





l0g2(32TO) -h log2 




where k = afat(7/8) < em. 



Note how the fat shattering dimension at scale 7/8 plays the role of the VC 
dimension in this bound. This result motivates the use of the term effective 
VC dimension for this value. In order to make use of this theorem, we must 
have a bound on the fat shattering dimension and then calculate the margin of 
the classifier. We begin by considering bounds on the fat shattering dimension. 
The first bound on the fat shattering dimension of bounded linear functions in 
a finite dimensional space was obtained by Shawe-Taylor et al. PUj. Gurvits 0 
generalised this to infinite dimensional Banach spaces. We will quote an improved 
version of this bound for Hilbert spaces which is contained in 0 (slightly adapted 
here for an arbitrary bound on the linear operators). 
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Theorem 2. H Consider a Hilbert space and the class of linear functions L 



of norm less than or equal to B restricted to the sphere of radius R about the 
origin. Then the fat shattering dimension of L can be bounded by 



containing the points and the norm of the linear functionals involved. Clearly, 
scaling by these quantities will give the margin appropriate for application of 
the theorem. 

3 Main Result 

Let X be a Hilbert space. We define the following Hilbert space derived from X . 

Definition 3. Let Lf{X) be the set of real valued functions f on X with support 
supp(/) finite, that is functions in Lf(X) are non-zero only for finitely many 
points. We define the inner product of two functions f,g G Lf{X), by 



Note that the sum which defines the inner product is well-defined since the 
functions have finite support. Clearly the space is closed under addition and 
multiplication by scalars. 

Now for any fixed Z\ > 0 we define an embedding of X into the Hilbert space 
X X Lf{X) as follows. 

ta ■ X 1 -^ Xa = {x, A5x), 
where 5x G Lf{X) is defined by 



We begin by considering the case where A is fixed. In practice we wish to choose 
this parameter in response to the data. In order to obtain a bound over different 
values of A it will be necessary to apply the following theorem several times. 
For a linear classifier u on X and threshold 6 S 3? we define 

f^((x,2/),(u,5),7) = max{0,7- y((u • x) - 6)}. 

This quantity is the amount by which u fails to reach the margin 7 on the point 
(x, y) or 0 if its margin is larger than 7. Similarly for a training set S, we define 




In order to apply Theorems Q and |3 we need to bound the radius of the sphere 



{f-g)= f{x)g{x). 



a:eSUpp(/) 
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Theorem 3. Fix A > 0, b G iR. Consider a fixed but unknown probability dis- 
tribution on the input space X with support in the ball of radius R about the 
origin. Then with probability 1 — <5 over randomly drawn training sets S of size 
m for all "f > 0 the generalization of a linear classifier u on X thresholded at b 
is bounded by 



e(m, fc, i5) 





log2(32m) + log2 




where 



64.5(i?2 + Z \2)(|| u ||2 + (u, b),^f/A^) 

-^2 



provided m>2fe and k < em. 



Proof : Consider the fixed mapping ta and the augmented linear functional over 
the space X x Lf{X), 



u = 




d{{x,y),{u,b),'y)yS^ 



We claim that 

1. for X ^ S', (u • x) = (u • T/i(x)), and 

2. the margin of u with threshold b on the training set ta{S) is 7. 

Hence, the behaviour of the linear classifier (u, b) can be characterised by the 
behaviour of (u, b), while (u, b) is a large margin classifier in the space XxLf{X). 
Since for x G S, ||r(x)|p < and ||u|p = ||u|p + D{S, (u, b),j)'^/A'^, the 

result will then follow from an application of Theorems Q and |21 Note that we 
have replaced the constant 64 by 64.5 to ensure the continuity from the right 
required by Theorem Q 

1. The first claim follows immediately from the observation that for z ^ S, 

( X! =0. 

\(x,i/)gS / 

2. For (x',y') G S, we have 



y'((u,Tzi(x')) - b) 



y'((u,x') -b) + y' 



\(x,y)GS 



> 7 - d((x', y'),u, 7) + d((x', y'),u, 7) = 7. 



The theorem follows. ■ 
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We now apply this theorem several times to allow a choice of A which approxi- 
mately minimises the expression for k. Note that the minimum of the expression 
(ignoring the constant and suppressing the denominator 7^) is (R+D)^ attained 
when A = \/ RD . 



Theorem 4 . Fix b G iR. Consider a fixed but unknown probability distribution 
on the input space X with support in the ball of radius R about the origin. Then 
with probability 1 — 5 over randomly drawn training sets S of size m for all 
7 > 0 such that d((x,y), (u, fe),7) = 0, for some (x, y) G S, the generalization of 
a linear classifier u on X satisfying ||u|| < 1 is hounded by 



e{m,k,6) = — ( fclog 



8em\ 



log2(32m) -b log; 



/ 2to(28 + log2(m)) 



where 



k = 



Qb[{R + Df + 2.2bRD] 



for D = H(S', (u, 5), 7), and provided m > max{2/e, 6} and k < em. 



Proof: Consider a fixed set of values for Z\, Z\i = R\2m^‘^^ — IJ , A-i-i — 
Z\i/2, for i = 2,...,t, where t satisfies, i?/32 > At > i?/64. Hence, t < 
log2(128m°-^®) = 7 + 0.251og2(m). We apply Theorem 0 for each of these 
values of A, using S' = S/t in each application. For a given value of 7 and 
D = D{S, u,7), it is easy to check that the value of k is minimal for A — y/RD 
and is monotonically decreasing for smaller values of A and monotonically in- 
creasing for larger values. Note that \/RD < R^2\/m — 1, as the largest ab- 
solute difference in the values of the linear function on two training points 
is 2R and since d((x, y), (u, 5), 7) = 0, for some (x,y) G S, we must have 
d((x', y'), (u, 6), 7) < 2i?, for all (x', y') G S. Hence, as 2m°-^® — 1 > \/2(m—l)° '^^ 
for m > 6, we can find a value of At satisfying V RD/2 < At < y/RD, provided 
'/RD > i?/32. The value of the expression 

{R^ + A^){1 + D{S,u,-f)^ /A^) 

at the value At will be upper bounded by its value at Z\ = V RD/2. A routine 
calculation confirms that for this value of A, the expression is equal to (i? -I- 
DY + 2.25RD. Now suppose y/ RD < R/32. In this case we will show that 



(i?2 + A^t)Y + D^A^t) ^ {(^ + + 2.25i?i?} , 

so that the application of Theorem 0 with A — At covers this case once the 
constant 64.5 is replaced by 65. Recall that R/32 > At > i?/64 and note that 
yjD/R < 1/32. We therefore have 



-b A/){\ + D'^/A/) < i?2(i + l/32^){l + M^D'^/R^) 



<RU1 + 



1 

1024 



1-b 



64 ^ 
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< (l + ) { 1 + — ) 

V 10247 V 2567 



as required. The result follows. ■ 



4 Algorithmics 

Theorem 0 suggests a different learning goal from the maximal margin hyper- 
plane sought by the Support Vector Machine We should instead seek to 
minimise D{S, (u, 6 ), 7 ) for a given fixed value of 7 and subsequently minimise 
over different choices of 7 . Vapnik has posed this problem in a slightly more 
general form C3 Section 5.5.1] as follows. 

For non-negative variables > 0, we minimise the function 

m 

i=i 



subject to the constraints: 

(u • u) < C. 



( 1 ) 

(2) 



Note that throughout this section we will use a standard inner product but the 
same analysis and algorithm will apply if we use a kernel based inner prod- 
uct. Vapnik is most interested in values of cr close to 0 when F approximates 
the number of training set errors. If, however, we take cr = 2 and make the con- 
straint (0 an equality constraint, the problem corresponds exactly to minimising 
D{S, (u,b),"f), where 7 = IjVC. This follows from considering the hyperplane 
(u ',67 = {vi/ VC ^b/ ^/C) which has norm 1 and classifies the point {xj,yj) such 
that = ^jlVC, so that U(5', (u', 6 '), 7 ) = We 

now consider converting to the dual problem by introducing Lagrange multipli- 
ers Q!o for constraint 0 and Uj > 0, j = 1, . . . , m, for constraints (HJ. Setting 
the derivatives to zero and solving for u gives 



u = 



1 

2ao 



m 

i=i 



Substituting into the other expressions and simplifying results in the following 
Lagrangian, 



.j ttL lit ^ lit 

F{ao,a) = ^ X! OL^ajy^yj{xi ■ Xj) - aoC, 

1=1 bi=i 
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which must be maximised subject to the constraints, > 0, j = 0 , . . . , m, and 
jyjLi ^jVj = 0. It is convenient to use vector notation, with a denoting the 
vector of Oj, j = 1, . . . ,m, G the matrix with entries, Gij = yiyj{Ki ■ Xj), and 1 
the m vector with entries equal to 1. Using this notation we can write 

1 T" 1 

F(ao, a) = a a + 1 a a Ga — aoC. 

4 4ao 



We can optimise with respect to oq by computing and setting it equal to 
zero. 



dF{ao, a) 
dao 



-^a^Ga — C = 0. 



Hence, oq 




a^Ga and resubstituting 



F{a) 



F{ao,a) 



— ~a^a + l^Qf — V Ga^Ga 
4 



u = 



G 

a^Ga 



m 

i=i 



(3) 

(4) 



Note that we can ignore the constant factor in the formula for u as this will not 
affect the classification, and in fact a^Ga = ||u|p = C once the optimal value 
has been found. The value of b can also be determined from the values of a. We 
wish to confirm that this optimisation problem is concave. We can evaluate the 
Hessian H{F) of the function F as follows: 



1 , VCGa 

grad(F) = --„ + l--^. 

1 VC[{a^Ga)G-Gaa^G] 

Hence H{F) - --I ' 

We wish to verify that H{F) is concave, that is x'^H{F)x < 0 for all x. 
x'^H{F)x = -0.5||xf - C"[||a||^|!x||^ - (x • a)l] 



where C' is a positive constant and (. • .)g and ||.||g are the inner product and 
norm defined by the semi-definite matrix G. By the Cauchy-Schwartz inequality 
the expression in square brackets is non-negative, making the overall expression 
negative as required. Hence, the optimal solution can be found in polynomial 
time by applying a gradient based central path algorithm following grad(U) with 
an appropriate learning rate rj. 

Note further that a small change in 7 > 0 only changes the value of D{S, (u, 6), 7) 
by a small amount for a fixed (u, b). Hence, the optimal value of k can also only 
change by a small amount. Hence, solving the problem for a fine enough grid of 
values of 7 and choosing the value which minimises k will give a value which will 
be within an arbitrarily small margin of the overall optimum. 
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Finally, note that the computation described in equation can be performed 
using a kernel inner product in place of the input space inner product, the 
technique that is used in the Support Vector Machine. Indeed even if the ker- 
nel function is not positive definite, the problem remains concave provided its 
eigenvalues are not too negative and C is not chosen too large. 

5 Experiments 

A preliminary experiment was performed with the algorithm described in the 
previous section. The Boston housing data H2| was chosen for the experiments. 
Since this data is not linearly separable, we selected a subset of the data reaching 
a subnode of a decision tree produced by OCl. There were 225 examples (with 
13 features each) while the target function was taken as the decision boundary 
generated by OCl. Hence, the data was guaranteed to be linearly separable. We 
selected just 15 examples for training. The standard maximal margin algorithm 
generated a margin of 7 = 0.0284 once the data had been normalised in each 
coordinate. Hence, the maximal possible value of the parameter C would be 
1239 = 1 / 7 ^. The maximal margin hyperplane had an error rate of 0.2619 on 
the remaining 210 examples. 

The algorithm described above was run for a range of values of the parameter C 
and the training error, test error, and value of the indicator [R + were 

computed. The algorithm was implemented in matlab using a gradient based 
approach. Convergence was fast, but this was to be expected with such a small 
sample size. Larger sample sizes have not been tested. The a’s computed for one 
value of C were taken as initial values of the iterations for the next value of C. 
Table Ogives the resulting values obtained. 



c 


50 


75 


100 


125 


150 


175 


200 


225 


Training error 


0.0667 


0.0667 


0 


0 


0 


0 


0 


0 


Test error 


0.119 


0.095 


0.062 


0.071 


0.071 


0.071 


0.086 


0.110 


Indicator 


25.70 


31.79 


37.78 


43.65 


49.39 


55.03 


60.59 


66.10 


C 


250 


275 


300 


325 


350 


375 


400 


1239 


Training error 


0 


0 


0 


0 


0 


0 


0 


0 


Test error 


0.114 


0.114 


0.129 


0.124 


0.138 


0.152 


0.162 


0.262 


Indicator 


71.58 


77.01 


82.40 


87.76 


93.08 


98.38 


103.66 


274.98 



Table 1. Values of the error and indicator for different C 



The test error goes though a clear minimum of 6.2% at C = 100 (see Figure [Q. 
This is extremely impressive when we consider that the maximal margin hyper- 
plane has a test error of 26.2% (see last column of the table). Even for values of 
C where the training error is non-zero the test error is still significantly better 
than that achieved by the maximal margin hyperplane. This suggests that the 
algorithmic strategy proposed is worthy of further investigation. 
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Fig. 1. Test error, training error and indicator for different values of C 



The indicator results are less satisfactory as the indicator does not go through a 
minimum at or near the best value of C . In checking the values involved in the 
calculation it was verified that the ratio -D^/ 7 ^ does reduce with increasing C, 
but the value of R causes the overall expression to increase. Initially it was vainly 
hoped that by optimising R to be the radius of the minimum ball containing the 
data (the indicator values shown here were computed with this value for R) a 
minimum would occur. We conjecture that the effect results from the degenerate 
eigenvalues of the inner product matrix considered. Recent results im suggest 
that this will significantly reduce the “effective” VC dimension and hence the 
true R should be replaced by a smaller “effective” R, when the inner product 
matrix has a large proportion of small eigenvalues. 

6 Conclusion 

We have shown how an approach developed by Freund and Schapire [3| for mis- 
take bounded learning can be adapted to give pac style bounds which depend 
on the margin distribution rather than the margin of the closest point to the 
hyperplane. The bounds obtained can be significantly better than previously 
obtained bounds, particularly when some of the points are misclassified and ag- 
nostic bounds would need to be applied were a classical analysis to be adopted in 
which the square root of the sample size replaces the sample size in the denomi- 
nator. The bound is also more robust that that derived for the maximal margin 
hyperplane where a single point can have a dramatic effect on the hyperplane 
produced. 
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We have gone on to show how the measure of the margin distribution that ap- 
pears in the bound can be optimised by expressing the optimisation problem as 
a concave dual problem. This formulation also allows the problem to be solved 
in kernel spaces such as those used with the Support Vector Machine. Prelim- 
inary experiments provide evidence that the approach may improve practical 
classification performance on real world data. 

We believe that this paper presents the first pac style bound for a margin dis- 
tribution measure that is neither critically dependent on the nearest points to 
the hyperplane nor is an agnostic version of that approach. In addition, we be- 
lieve it is the first paper to give a provably optimal algorithm for optimizing the 
generalization performance of agnostic learning with hyperplanes, by showing 
that the criterion to be minimised should not be the number of training errors, 
but rather a more flexible criterion which could be termed a ‘soft margin’. The 
problem of finding a more informative and theoretically well-founded measure 
of the margin distribution has been an open problem for some time. This paper 
suggests one candidate for such a measure. 
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Abstract. It is known that the covering numbers of a function class 
on a double sample (length 2m) can be used to bound the generaliza- 
tion performance of a classifier by using a margin based analysis. In this 
paper we show that one can utilize an analogous argument in terms of 
the observed covering numbers on a single m-sample (being the actual 
observed data points). The significance of this is that for certain inter- 
esting classes of functions, such as support vector machines, there are 
new techniques which allow one to find good estimates for such covering 
numbers in terms of the speed of decay of the eigenvalues of a Gram 
matrix. These covering numbers can be much less than a priori bounds 
indicate in situations where the particular data received is “easy”. The 
work can be considered an extension of previous results which provided 
generalization performance bounds in terms of the VC-dimension of the 
class of hypotheses restricted to the sample, with the considerable ad- 
vantage that the covering numbers can be readily computed, and they 
often are small. 



1 Introduction 

The PAC framework (sometimes known as the Statistical Learning framework) 
for analysing the generalization of a learning system bases its analysis on the 
complexity of the class of hypotheses that can be output by the learning algo- 
rithm. Typically this leads to poor estimates of generalization as the class must 
be chosen large enough to solve a wide range of possible tasks. Structural Risk 
Minimisation counters this problem by placing an a priori hierarchy on the class 
of functions and allowing the learner to seek a function starting in the simpler 

* This work was supported by the Australian Research Council and the European 
Commission under the Working Group Nr. 27150 (NeuroCOLT2). 
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classes. If a satisfactory function is found in a simple class the corresponding 
bound on the generalization error is that much tighter. In this sense the es- 
timate is obtained a posteriori based on the class determined by the training 
algorithm. 

Only recently have techniques for bounding the tails of the distribution of a 
data-dependent estimator been proposed [tilTIbj . Initially Shawe-Taylor et al. jZj 
showed that the maximal margin hyperplane algorithm used for the support 
vector machine of Cortes and Vapnik |3| can be analysed in this way using the 
size of the margin as the predictor of generalization. This should be distinguished 
from classical Structural Risk Minimisation since the assignment of hypothesis 
to complexity class depends on the data and also the target function. The large 
margin approach has been extended to general neural networks by Bartlett |2|. 
The line taken in this paper is based on a more general framework developed 
in 0 which allows inference of good generalization from different measures of 
performance other than the margin of the classifier. 

Our main result is Theorem E which bounds the generalization error in terms of 
the covering numbers observed on the training set at a scale determined by the 
margin of the classifier. Roughly speaking, the role of the VC dimension in the 
traditional bound on classifier generalization performance is taken by the log of 
the covering number of the class when restricted to the observed data sample. 
The scale at which the covering number is measured depends on the observed 
margin. 

The idea of bounding generalization in terms of the VC dimension measured on 
the training sample was considered in The problem with the result there is 
that there is no simple way of estimating the VC dimension of a set of hypothe- 
ses. The approach would also not apply to bounding the generalization of large 
margin classifiers, since in that case the role of the VC dimension is played by 
the fat-shattering dimension at a scale dictated by the size of the margin. The 
present paper is motivated by the recently discovered m fact that empirical 
covering numbers can be readily determined for interesting classes of machines, 
such as SV machines. We will give one way of using those results as an example 
towards the end of the paper, though this will not represent the optimal use of 
the bound in general. 



2 Background Results 

We will assume that a fixed number m of labelled examples are given as a vector 
z = (x, t(x)) to the learner, where x = (xi, . . . , Xm), and t(x) = (t(xi ), . . . , t{xm)) 
WeuseErz(h) = |{i : h{xi) ^ t{xi)}\ to denote the nwm&er of errors that h makes 
on z, and erp(/i) = P{x:h{x) ^ t{x)} to denote the expected error when x is 
drawn according to P. In what follows we will often write Erx(h) (rather than 
Erz(h)) when the target t is obvious from the context. If x,y G V™, we denote 
by xy their concatenation {x \, . . . , Xm,yi , . ■ • , ym)- 

The key concept introduced in Shawe-Taylor et al. is ‘luckiness’. The main 
idea is to fix in advance some simplifying assumption about the target function 
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and distribution, and encode this assumption in a real- valued function defined 
on the space of training samples and hypotheses. The value of the function 
indicates the extent to which the assumption has been upheld for that sample 
and hypothesis. 

We will not use this theory directly but will follow the spirit of the luckiness 
approach in bounding the probability that the covering numbers on the first half 
of the sample differ significantly from those on the double sample. This result 
can then be used to bound the generalization of a classifier with an observed 
margin. 

We give the definition of the fat-shattering dimension, which was first introduced 
in 0, and has been used for several problems in learning since 

Definition 1. Let be a set of real valued funetions. We say that a set of 
points X is y-shattered by T relative to r = (rx)xex if there are real numbers 
rx indexed by x G X sueh that for all binary vectors b indexed by X, there is a 
function fb GJ satisfying 

J V I < — 7 otherwise 

The fat-shattering dimension fatgr of the set T is a function from the positive real 
numbers to the integers which maps a value 7 to the size of the largest 'y-shattered 
set, if this is finite, or infinity otherwise. 

Note that in our definition of the fat-shattering dimension we have used a slightly 
unconventional strict inequality for the value on a positive example. This will 
prove useful in the technical detail, but also ensures that the definition reduces 
to the Pollard dimension for 7 = 0. 

We begin with a technical lemma which analyses the probabilities under the 
swapping group of permutations used in the symmetrisation argument. The 
group S consists of all 2"* permutations which exchange corresponding points 
in the first and second halves of the sample, i.e. Xj <-> yj for j G m\. 

Lemma 1. E Let S be the swapping group of permutations on a 2 m sample 
of points xy. Consider any fixed set zi,...,Zd of the points. For 3 k < d the 
probability Pd,k under the uniform distribution over permutations that exactly k 
of the points zi, . . . , Zd are in the first half of the sample is bounded by 

< (t) 2 -‘- 

Before we can quote the next lemma, we need another definition. 

Definition 2. Let {X,d) be a (pseudo-) metric space, let A be a subset of X and 
e > 0. A set B C X is an e-cover for A if, for every a G A, there exists b G B 
such that d{a,b) < e. The e-covering number of A, Nd(e, A), is the minimal 
cardinality of an e-cover for A (if there is no such finite cover then it is defined 
to be 00 ). We will say the cover is proper if B C A. 
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We have used a somewhat unconventional less than or equal to in the definition 
of a cover, as this will prove technically useful in the proofs. We next define the 
covering numbers that we are concerned with. 

Definition 3. Let 3^ be a class of real-valued functions on the space X. For any 
TO G N and x G , we define the pseudo-metric 

dM,g) = max \f{xi) - g{xi)\. 

l<i<m 

This is referred to as the l°° distance over a finite sample x = (xi, . . . , Xm)- We 
write lN(e, IJ'jX) = IJ”). Note that the cover is not required to be proper. 

Observe that lNl(e, T, x) = 3\f;cx, (e, dx), the l°° covering number of 

3^^-.= {{f{x,),...,f{x^))-.f eT}, 



the class IF restricted to the sample x. 



We now quote a lemma from [71 which follows directly from a result of Alon et 

ai. m. 

Corollary 1. Let J be a class of functions X —>• [a, 6] and P a distribution 
over X . Choose 0 < e < 1 and let d = fatgr(e/4). Then 



sup lN(e, IF, x) <2 



f Am{b — a)^ 

V 



d \og{2em{b— a) /{de)) 



Let TTj{a) be the identity function in the range [9 — 2.017,0], with output 6 for 
larger values and 9 — 2.OI7 for smaller ones, and let = {TTj{f): f G IF}. 

The choice of the threshold 9 is arbitrary but will be fixed before any analysis 
is made. 

We will need some compactness properties of the class of functions which will 
hold in all cases usually considered. We formalise the requirement in the following 
definition. 



Definition 4. Let 

x: J — > R"*, x: / 1-^ {f{xi),f{x 2 ), • - • , f{xm)) 

denote the multiple evaluation map induced by yi= (x\, . . . ,Xm) G A™. We say 
that a class of functions IF is sturdy if for all m gN and all x G A™ the image 
x(lF) of J under x is a compact subset o/K™. 



Lemma 2. Let IT be a sturdy class of functions. Then for each A G N and any 
fixed sequence x G A"*, the infimum -jn = infjy : A(y, IF, x) = A}, is attained. 

Proof : We first show that for any fixed sequence of functions / = (/i, . . . , /a?) 
the infimum 



7j = inf{7 : = T} 
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is attained, where = {/ : dx{f,fj) < 7} is the closed ball centred at fj 

of radius 7 in the c?x metric. For any m G N, the multiple evaluation map 

x:J — x: f 1-^ (f{xi),...,f{xm)) 

has as its image a closed compact subset of K™ by the definition of sturdiness. 
The function fj maps to a point in this subset and the functions in Bj(fj) are 
precisely those functions whose image lies in the rectangle with sides 2y centred 
at x(/j). If we create the Voronoi diagram V = U^i about the points i{fj), 
j = relative to the l°° metric in R"* then x(T) H Vj is closed and 

compact for j = 1 ,..., 7V. Thus there is a point zj G x(5") H Vj which is a 
maximum distance from x{fj) (j = 1 , . . . , N). We can thus define 

7/^= ^max ||z-x(/,)||/~. 

zG Vj-nx(iJ) 

With If = maxj{7/^.} we have [Jj Bj-{fj) = T as required. To complete the 
proof observe that the mapping 

7:i(T)CR-^R, 

is continuous. Hence the mapping 

7^ : (i(T))^ C ^ R 

7^ : (s(/i ),..., i(/Ar)) H^7^- = max7(i(/j)) 

is too. Since i(T) is compact, (a;(T))'^ is too and so 7-^ attains its minimum 

In- ■ 

We will use the following lemma, which in the form below is given by Vapnik 0 
page 168]. 

Lemma 3. Let X be a set and S a system of sets on X , and P a probability 
measure on X. For x G X'^, y G X'^, and A G S, define ^'x(^) := |x n A\/m. 
If m > 2/e, then 

P™ ix: sup I^'x(tI) — P(^)| > el < 2P^"* ixy: sup l^'x(^) — Vy{A)\ > e/2 
I AgS j I AgS 

3 Covering Numbers on a Double Sample 

We begin by presenting a key proposition that shows with high probability the 
covering numbers on a sample provide a good estimate of the covering num- 
bers on a double sample. Although the result contains no reference to the fat- 
shattering dimension, it does play a key role in the proof. It is the combinatorial 
properties of the fat-shattering dimension which make it possible to infer the 
properties of the second half of the sample from the first. The probabilistic in- 
ference of the fat-shattering dimension on the double sample in terms of its 
value on the first half involves a multiplicative factor slightly larger than three. 
Its precise form is given in the following definition. 



Classifiers in Terms of Observed Covering Numbers 



279 



Definition 5. For U G N and 6 G K'*", we define the function 



a{U,6) = 3.08 




-Ini 
U 6 



Proposition 1. For fixed U G N we have for all e > 0, S G (0, 1) and m G N, 
p2™{xy : |^[log]M(e/4,J,x)J = U and 

< N(e, 7 t,(J), xy)^ } < d, 

Proof : Part 1. If is an e-cover of Tx for the function class T and By is an 
e-cover for Ty, we can form an (improper) cover B of Txy by simply choosing 
a function which agrees with each pair of functions from B^ x By on their 
respective domains. If the sequence y contains common points with x delete all 
such common points from y. The size of the cover required for y will decrease 
as a result. Hence, \B\ < |Hx||i?y|. It follows that 

I^(e,T,xy) <}^(e,T,x)I^(e,T,y). (1) 

Next observe that the fat-shattering dimension fatgr^(e) satisfies 

fatgr,,(e) < [logN(e,T,x)J , (2) 

since any pair of functions realising a distinct dichotomy with margin e must 
differ by more than 2e at some point in x and hence cannot be covered by the 
same function in any cover. 

Part 2. For any e > 0 let 

^xy := {xy:a(fatgr^(e),5)fatgr^(e) < fatgr^^(e)}. 

Following an argument similar to one in [Z] we will show that for any e > 0, 
p 2 m(^£y) ^ ^ _ fatgr^^(e) and suppose z = (zi,...,Zd) C xy are e- 

shattered by T. We use the usual permutation argument. Let 

Ek ■■= {xy:fc = fatgr^(e), a{k,S)k < d} 

and observe that = |J^ Ek- Since if |z n x| = k, fatg^^(e) > k, we have 

Ek C Gk '■= {xy: |z n x| = k, a{k, S)k < d} 

and by the union bound, 

p2™(yl^y)<^p2-(Gfc)= ^ P2m{xy:|znx| = fc} 

k k:a{k,S)k<.d 
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But a{k,6)k < d ^ 3k < d for all 6 G (0,1). Thus by setting U to satisfy 
a{U, S)U = d = all we can write 



< 



u 



u 






•)-d 



< 2"'^ ( ^ 



ed\ 

U 



= 2 



-aU 



{eaY 



where we have used Lemma E One can readily check that for all 6 G (0, 1) and 
all C/ G N, Thus < 6, for all e > 0. 

Part 3. Let B^y be the event in the statement of the proposition. We will show 

that B^y C A^Jy and thus < P^'^iAtly) < 5. We do this by showing 

that is true” => “^x'y is true” . Now 



2N(e,T,x)2“(0'5)Giog(i7m)iog(5em/;7) ^ jr, (T) , xy ) 

2N(e,T,x)2“(0-5)Giog(i7m)iog(5em/;7) ^ jr, (T) , x)N(e, 7T, (T) , y) 

2N(e T j,^2“(0«5)Glog(17m) log(5em/(7) 



<N(e,7T,(T),x)2(^ 



( 4m(2.0l£)^ 



j. / 2em2.01e | 



( 3 ) 

( 4 ) 



^ 2N(e, T, x)2“^^’^*^* («/4).'5)fat3^,, (e/4) log(17m) log(5em/fat3.^ (e/4)) 

< N(e, T, x)2(17m)f^*^-^ log(5em/fat,,^^ (e/4)) ( 5 ) 

= 1 > a(fatgr^(e/4), (5)fatgr^(e/4) log(17m) log( 5 em/fat 3 -^(e/ 4 )) + 1 

< fatgr^y(e/4) log(5em/fatgr^^(e/4)) log(17m) + 1 ( 6 ) 

^ a(fatgr^(e/4),5)fatgr^(e/4) < fatgr^^(e/4) (7) 



where (0) follows from CD;0) follows from the fact that fatg^^ (e/4) < fatg^^y(e/4), 
that the range of functions in 7Te(T) is an interval [a,b\ with b — a = 2e and 
Corollary 0 (0) follows from (0) and the fact that ^(6, 7re(T), x) < N(e, T, x); 

follows by dividing both sides of the inequality by N(e, T, x) and taking logs; 
03 follows from the fact that fatg^^(e/4) < fatg^^y(e/4) and dividing out common 
terms on both sides. Now ( 0 ) defines the event AAj^y as required. ■ 



Lemma 4. Suppose 3^ is a sturdy set of functions that map from X to M. Then 
for any distribution P on X , and any ?7 G N and any 0 G K 

P"™{xy: (3f G T, r = max{/(a;,)}, 2y < 0 - r, [log3^(7/4, l^,x)J = U, 

^ |{* : fiVi) > fi*}! > e(m, k, (5)^ | < (5, 

where e{m,k,6) = ^(C/(log ^ log(17m)a(C/, (5/2) + 1) + log|). 

Proof: Using the standard permutation argument (as in 0), we may fix a 
sequence xy and bound the probability under the uniform distribution on swap- 
ping permutations that the permuted sequence satisfies the condition stated. 
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Let 

7[/ :=min{7': [log >1(774, x)J =U}. 

By Lemma 13 and the sturdiness of T, the minimum is attained by some choice 
of 2^ functions from IL. The probability above is no greater than 

(xy : 37 G M+, [log?^(774, If, x)J = C/,3/ G ^,^727^,)} , 

where ^/(7) is the event that f{yi) > maxj{f{xj)} + 7 for at least me{m, k, i5) 
points Ui in y. Note that r + 27 < 9. Consider a minimal ju-cover i?xy of 7Tj^(J) 
in the pseudo- metric dxy In the remainder we will suppress the -ji/ subscript to 
the function tt to simplify the notation. We have that for any f G 3^, there exists 
/ G Bxy, with |7r(/)(a;) — 7r(/)(a;)| < 7(7 for all x G xy. Thus since for all x G x, 
by the definition of r, f{x) < r < 6 — 2y, Tr{f){x) < max{0 — 2y,0 — 2.0057[/}, 
and so n{f){x) < 9 — 7;/. However there are at least me{m,k,6) points y G y 
such that f{y) >9> r + 2^, so n{f){y) > r 3- 27 — 7(7 > ma,Xj{TT{f){xj)}. Since 
7T only reduces separation between output values, we conclude that the event 
H^(0) occurs. By the permutation argument, for fixed / at most of 

the sequences obtained by swapping corresponding points satisfy the conditions, 
since the em points with the largest / values must remain on the right hand side 
for H^(0) to occur. Thus by the union bound 

{xy : 37 G M+, [logN(774,T,x)J = C/, 3/ G ^,Af{2yu)} 

where the expectation is over xy drawn according to Hence, by Proposi- 

tion [H with probability at least 1 — <5/2 

E{\Bxy\) < 2 N( 7 [/,T,x) 2 “('^’‘^/ 2 )C/log( 17 m)log( 5 em/( 7 ) 

^ 2'i--\-U[log{17m) log{5em/U)a{U,6/2)-\-l] 

and so P(|Hxy < S/2 provided 

e(m, k,S)> A (jj (1 + log(5em/C/) log(17m)<2(C/, <5/2)) -b log |) , 
as required. ■ 

We define the mapping ^ ]^xx{o,i} |^y 

fix, c) = f{x){l -c) + (29 - f{x))c, 

for some fixed real 9. For a set of functions T, we define iF=fFg = {/;/GT}. 
The idea behind this mapping is that for a function /, the corresponding / maps 
the input x and it classification c to an output value, which will be less than 9 
provided the classification obtained by thresholding f{x) at 9 is correct. 

Let Tg denote the threshold function at 9\ Tg:M. —> {0,1}, Tg{a) = 1 iff a > 9. 
For a class of functions T, Tg(T) = (Tg(/): / G Tj. 



282 John Shawe-Taylor and Robert C. Williamson 



Theorem 1. Consider a sturdy real valued function class T. Fix 0 G R. // 
a learner correctly classifies m independently generated examples z with h = 
Tg(f) G 2g(T) such that Erz(/i) = 0 and for all 7 such that 7 < min \ f{xi) — 9\, 
then with confidence 1 — (5 the expected error of h is bounded from above by 

e(m, U,S) = ^ (^U + a{U, 6/2) log log(17m)^ + log ^ , 

where U = [log J'l( 7 / 8 , T, x)J . 

Proof : Making use of lemma El we will move to the double sample and stratify 
by U. By the union bound, it thus suffices to show that < 6/2, 

where 



Ju = {xy : 3h = Tg(f) G Tg(y),Fv^{h) = Q,U= [log Jf( 7 / 8 , T, x)J , 
7 < min \ f{xi) — 0|, Ery(/i) > me(m, U, 6)/2}. 



(The largest value of U we need consider is 2m, since for larger values the bound 
will in any case be trivial). It is sufficient if P^'^{Ju) < ^ = 6' . Consider 
T = Tg. The probability distribution on X = Xx{0,l}is given by P on X with 
the second component determined by the target value of the first component. 
Note that for a point y G y to be misclassified, it must have f(jj) > 0 > 
max{/(a;): x G x} + 7 , so that 



Jk |xy G (X X {0, 1})^"* : 3/ G T, r = max{/(x): x G x}, 7 < 0 - r, 

C/ = [logN( 7 / 8 ,T,x)J, {yGy:/(y)>0} > me(m, C/, d)/ 2 |. 



Replacing 7 by 7/2 in LemmaEland appealing to LemmaOlwe obtain J[/) < 
6' for 



e(m, U,6) = — (U{1 + a(U,6/2) log(5em/C/) log(17m)) + log(4/i5')) . 
m 

The condition of Lemma El is satisfied by this linking of e and m. Substituting 
for 6' gives the result. ■ 

Despite superficial appearances Theorem Q] is quite different from results ob- 
tained in Pj. For example, the bound involving the margin of a classifier given 
there relies on an a priori bound on the fat-shattering dimension for the whole 
class, not the fat-shattering dimension (or in our case the logarithm of the cov- 
ering numbers) of the class restricted to the training set. The other result of |Zj 
which is reminiscent of Theorem [E involves bounding the generalization error in 
terms of the VC dimension of the set of hypotheses restricted to the training 
set. This result cannot take into account the margin of a large margin classifi- 
er, but refers to classical generalization bounds in terms of the VC dimension. 
The motivation for obtaining Theorem Q is recent work computing the covering 
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numbers for Support Vector Machines in terms of the eigenvalues of the kernel. 
These results will be described in the next section. They show that in many 
cases the bounds may be significantly smaller than could be obtained by a priori 
knowledge of the fat-shattering dimension. Thus even though the log covering 
numbers and fat shatterring dimension can only differ by log(m) factors, the 
bounds one has on the quantities can differ significantly; that is certainly the 
current situation with regard to support vector machines. 



4 Generalization from Covering Numbers 

This section will sketch how results can be obtained which combine Theorem [D 
with bounds on covering numbers introduced in We will give one example 
of the type of bound that can be derived. We stress this is just one way to use 
the results; we will present others in a fuller version of this paper. 

We first quote some results from the paper cni that has developed the techniques 
for directly bounding the covering numbers of Support Vector Machines. In order 
to avoid introducing a large number of definitions we will summarise some of 
the results in the following theorem which combines Corollary 5 and Theorem 
7 (assertion (12)) of P!3|. The derivation of this result from the general theory 
derived in m requires restricting consideration of the input space to the training 
set . In this case the kernel function k of the Support Vector Machine is completely 
defined by the inner product matrix Gij = k{xi,Xj). We denote by Xs,s = 
1 , . . . ,TO the eigenvalues of this matrix in decreasing order. The parameter Ck 
of the kernel k is the largest entry in the matrix of eigenvectors. For a vector z, 
let = 

Theorem 2. m Let X G V"* be an m-sample for a Support Vector Machine 
with kernel k, with As and Ck defined as above. Let the maximal margin of the 
classifier be 7. Then the scale ejv that can be achieved for a d^ cover of size N 
of the set of linear functions implemented by the SVM satisfies 

CN < inf ^^||(\Ar/as)s||z2 maxV^'^-^(aia2 . . .aj)^/V 

(a,),:a,/0 7 jGN 

The above theorem can be used as follows. We have not attempted to provide a 
closed form formula for >1(7/8, T, x) because the evaluation of the above bound 
depends very much on how the Xi decay. Specifically the j for which the max- 
imum occurs is difficult to determine precisely a priori. Since the whole point 
of using the result is to use empirical values which will necessitate a numerical 
computation in any case, we do not see this as a disadvantage. 

Pick Os = \/K. For any iV G N let j* = j*{N) = argmaxj 0^=1 - 

Thus 

fr 
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Setting 7/8 = e = Cat and solving for N, we obtain 



Jf(7Ar/8, 






giving an effective VC dimension of j* . One can readily make use of this numer- 
ically since 7 at is monotonically decreasing in N and hence a binary search can 
be used to find N (and the relevant value of j*) as a function of a given 7. The 
value of 1M(7/8, x) can be then used in Theorem^l 

In practical examples the eigenvalues of the inner product matrix frequently 
decay very fast. The above argument shows that such a decay can be translated 
into an “effective” VC dimension relevant to the achieved margin of the classifier. 



5 Conclusions 

This paper has presented a method by which recently achieved bounds m on 
the covering numbers of a function class on a training set can be used to bound 
the generalization error of the resulting classifier. In the previous section we have 
shown how this method can then be used to derive alternative bounds on the 
generalization error derived from observed properties of the margin and inner 
product matrix of a Support Vector Machine. 

Improved bounds can be used to guide more refined Structural Risk Minimization 
over choices of different kernels for example. Hence, the approach developed here 
may well have applications in practical learning systems. Our hope is that these 
methods may also be able to give bounds that are more realistic than previous 
PAG estimates. 
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Abstract. We derive new bounds for the generalization error of feature 
space machines, such as support vector machines and related regular- 
ization networks by obtaining new bounds on their covering numbers. 
The proofs are based on a viewpoint that is apparently novel in the field 
of statistical learning theory. The hypothesis class is described in terms 
of a linear operator mapping from a possibly infinite dimensional unit 
ball in feature space into a finite dimensional space. The covering num- 
bers of the class are then determined via the entropy numbers of the 
operator. These numbers, which characterize the degree of compactness 
of the operator, can be bounded in terms of the eigenvalues of an inte- 
gral operator induced by the kernel function used by the machine. As a 
consequence we are able to theoretically explain the effect of the choice 
of kernel functions on the generalization performance of support vector 
machines. 



1 Introduction, Definitions and Notation 

In this paper we give new bounds on the covering numbers for feature space 
machines. This leads to improved bounds on their generalization performance. 
Feature space machines perform a mapping from input space into a feature 
space construct regression functions or decision boundaries based on this map- 
ping, and use constraints in feature space for capacity control. Support Vector 
(SV) machines, which have recently been proposed as a new class of learning 
algorithms solving problems of pattern recognition, regression estimation, and 
operator inversion m are a well known example of this class. 

A key feature of the present paper is the manner in which we directly bound the 
covering numbers of interest rather than making use of a Combinatorial dimen- 
sion (such as the VC-dimension or the fat-shattering dimension) and subsequent 
application of a general result relating such dimensions to covering numbers. We 

* Supported by the Australian Research Council and the DFG Ja 379/71). 
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bound covering numbers by viewing the relevant class of functions as the image 
of a unit ball under a particular compact operator. The results can be applied to 
bound the generalization performance of SV regression machines, although we 
do not explictly indicate the results so obtained in this brief paper. 

Capacity control. In order to perform pattern recognition using linear hyper- 
planes, often a maximum margin of separation between the classes is sought for, 
as this leads to good generalization ability independent of the dimensionality 
E3 . It can be shown that for separable training data (xi, j/i), . . . , (x^, 2/m) G 
X {±1}, this is achieved by minimizing ||w ||2 subject to the constraints 
yj{{w,Xj) -I- 6) > 1 for j = and some 6 S K. The decision func- 

tion then takes the form /(x) = sgn((w,x) -|- b). Similarly, a linear regression 
/(x) = (w, x) -1-5 can be estimated from data (xi, ?/i), . . . , {x,n,ym) G x R by 
finding the flattest function which approximates the data within some margin 
of error: in this case, one minimizes ||w ||2 subject to \f{xj) — yj\ < e, where 
the parameter e > 0 plays the role of the margin, albeit not in the space of the 
inputs X, but in that of the outputs y. 

Nonlinear kernels. In order to apply the above reasoning to a rather general 
class of nonlinear functions, one can use kernels computing dot products in 
high-dimensional spaces nonlinearly related to input space PCI Under certain 
conditions on a kernel fc, to be stated below (Theorem^), there exists a nonlinear 
map (p into a reproducing kernel Hilbert space F such that k computes the dot 
product in F, i.e. fc(x,y) = {'P{x.),(p{y)) p . Given any algorithm which can be 
expressed in terms of dot products exclusively, one can thus construct a nonlinear 
version of it by substituting a kernel for the dot product. 

By using the kernel trick for SV machines, the maximum margin idea is thus 
extended to a large variety of nonlinear function classes (e.g. radial basis func- 
tion networks, polynomial networks, neural networks), which in the case of 
regression estimation comprise functions written as kernel expansions /(x) = 
+ 5, with aj G M, j = 1, . . . ,m. It has been noticed that dif- 
ferent kernels can be characterized by their regularization properties jSOj . This 
provides insight into the regularization properties of SV kernels. However, it 
does not give us a comprehensive understanding of how to select a kernel for 
a given learning problem, and how using a specific kernel might influence the 
performance of a SV machine. 

Definitions and Notation For d G N, denotes the d-dimensional space 
of vectors x = {xi, . . . ,Xd)- We define spaces as follows: as vector spaces, 
they are identical to in addition, they are endowed with p-norms: for 0 < 

p < oo, ||x||^d := ||x||p = ^r p = oo, ||x||^^ := ||x||oo = 

maxj=i^..._d \xj\. Analogously ip is the space of infinite sequences with the obvious 
definition of the norm. Given m points Xi, . . . ,x^ G £p, we use the shorthand 
X™ = (xf, . . . ,x^). Suppose T is a class of functions defined on The 
norm with respect to X"* of / G T is defined as ||/||^x™ := maxi=i_...^m |/(xi)|. 
Given some set X, a measure p on X, some 1 < p < oo and a function /: X — *■ K 
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we define ||/||lp iJ if the integral exists and ||/||loo •= 

esssup,pgx \f{x)\. For 1 < p < oo, we let Lp(X) := {/: X ^ K: \\f\\L^ < oo}. 

Let £(-£, F) be the set of all bounded linear operators T between the normed 
spaces {E, || • jj^;) and {F, || • ||f), i.e. operators such that the image of the (closed) 
unit ball Ue '■= {a; G E: \\x\\e < 1} is bounded. The smallest such bound is called 
the operator norm, ||T|| := sup^jgiy^ ||Ta;||i?. The nth entropy number of a set 
M C E, for n G N, is 

en{M) := inf{e > 0: 3 an e-cover for M in E containing n or fewer points}. 

The entropy numbers of an operator T G £,{E,F) are defined as en(T) := 
tn(T{UE))- Note that ei(T) = ||T|j, and that e„(T) certainly is well defined 
for all n G N if T is a eompaet operator, i.e. if T{Ue) is compact. The dyadic 
entropy numbers of an operator are defined by en{T) := e 2 »>-i(r), n G N. A very 
nice introduction to entropy numbers of operators is |5]. The e-covering number 
o/ T with respect to the metric d denoted N(e, T, d) is the size of the smallest 
e-cover for T using the metric d. By log and In, we denote the logarithms to base 
2 and e, respectively. By i, we denote the imaginary unit i = -\/— 1, k will always 
be a kernel, and d and m will be the input dimensionality and the number of 
examples (xi,?/i), . . . , {xm,ym) G x M, respectively. We will map the input 
data into a feature space via a mapping <P. We let x := ^(x). 

2 Operator Theory Methods for Entropy Numbers 

In this section we briefly explain the new viewpoint implicit in the present pa- 
per. With reference to Figure Dl consider the traditional viewpoint in statistical 
learning theory. One is given a class of functions T, and the generalization per- 
formance attainable using T is determined via the covering numbers of T. More 
precisely, for some set X, and x^ G X for i = define the e-Growth 

function of the function class T on X as 

J^”(e,T):= sup (1) 

Xi,...,Xr7iG3C 

where N'(e, T, ) is the e-covering number of T with respect to . Many 
generalization error bounds can be expressed in terms of N™(e, T). An example 
is given in the following section. 

The key novelty in the present work solely concerns the manner in which the 
covering numbers are computed. Traditionally, appeal has been made to a re- 
sult such as the so-called Sauer’s lemma (originally due to Vapnik and Chervo- 
nenkis). In the case of function learning, a generalization due to Pollard (called 
the pseudo-dimension), or Vapnik and Chervonenkis (called the VC-dimension 
of real valued functions), or a scale-sensitive generalization of that (called the 
fat-shattering dimension) is used to bound the covering numbers. These results 
reduce the computation of N™(e, T) to the computation of a single “dimension- 
like” quantity. An overview of these various dimensions, some details of their 
history, and some examples of their computation can be found in |5| . 
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In the present work, we view the class 3^ as being induced by an operator Tk 
depending on some kernel function k. Thus T is the image of a “base class” S 
under T^. The analogy implicit in the picture is that the quantity that matters 
is the number of e-distinguishable messages obtainable at the information sink. 
(Recall the equivalence up to a constant factor of packing and covering number- 
s.) In a typical communications problem, one tries to maximize the number of 
distinguisable messages (per unit time), in order to maximize the information 
transmission rate. But from the point of view of the receiver, the job is made eas- 
ier the smaller the number of distinct messages that one needs to be concerned 
with decoding. The significance of the picture is that the kernel in question is 
exactly the kernel that is used, for example, in support vector machines. As a 
consequence, the determination of lN'"*(e, T) can be done in terms of properties 
of the operator T^. The latter thus plays a constructive role in controlling the 
complexity of T and hence the difficulty of the learning task. We believe that the 
new viewpoint in itself is potentially very valuable, perhaps more so than the 
specific results in the paper. A further exploitation of the new viewpoint can be 
found in PEI- There are in fact a variety of ways to define exactly what is meant 
by Tfc, and we have deliberately not been explicit in the picture. We make use 
of one particular in this paper. A slightly different approach is taken in m- 

We conclude this section with some brief historical remarks. 

The concept of the metric entropy of a set has been around for some time. It 
seems to have been introduced by Pontriagin and Schnirelmann m and was 
studied in detail by Kolmogorov and others m- The use of metric entropy to 
say something about linear operators was developed independently by several 
people. Prosser appears to have been the first to make the idea explicit. He 
determined the effect of an operator’s spectrum on its entropy numbers. In par- 
ticular, he proved a number of results concerning the asymptotic rate of decrease 
of the entropy numbers in terms of the asymptotic behaviour of the eigenvalues. 
A similar result is actually implicit in section 22 of Shannon’s famous paper 
where he considered the effect of different convolution operators on the entropy of 
an ensemble. Prosser’s paper m led to a handful of papers (see e.g. I26I15I3I210 
which studied various convolutional operators. A connection between Prosser’s 
e-entropy of an operator and Kolmogorov’s e-entropy of a stochastic process was 
shown in 0. Independently, another group of mathematicians including Carl 
and Stephani 0 studied covering numbers m and later entropy numbers 1^ 
in the context of operator ideals. (They seem to be unaware of Prosser’s work 
— see e.g. P p. 136].) 

Connections between the local theory of Banach spaces and uniform convergence 
of empirical means has been noted before (e.g. 1221 ). More recently Gurvits m 
has obtained a result relating the Rademacher type of a Banach space to the fat- 
shattering dimension of linear functionals on that space and hence via the key 
result in ^ to the covering numbers of the induced class. We will make further 
remarks concerning the relationship between Gurvits’ approach and ours in PEI; 
for now let us just note that the equivalence of the type of an operator (or of 
the space it maps to), and the rate of decay of its entropy numbers has been 



Entropy Numbers, Operators and Support Vector Kernels 289 



(independently) shown by Kolchinskii and Defant and Junge |1 211 6| . Note 

that the exact formulation of their results differs. Kolchinskii was motivated by 
probabilistic problems not unlike ours. 



3 Generalization Bounds via Uniform Convergence 



The generalization performance of learning machines can be bounded via uniform 
convergence results as in m- The key thing about these results is the role of 
the covering numbers of the hypothesis class — the focus of the present paper. 
Results for both classification and regression are now known. For the sake of 
concreteness, we quote below a result suitable for regression which was proved 
in 1^. Let Pm{f) ■= Ejli fi^j) denote the empirical mean of / on the sample 

Xl , . . . , XjTT, . 

Lemma 1 (Alon, Ben— David, Cesa— Bianchi, and Haussler, 1997). Let 

[L be a class of functions from X into [0, 1] and let P he a distribution ouer X. 
then, for all e > 0 and all m > ^, 



Pr sup \Pm{f) - P{f)\ > e ^ < 
/ 6 ? 



> < 12m • E 




1 


\ / . 



^/36 



( 2 ) 



where Pr denotes the probability w.r.t. the sample Xi, . . . ,Xm drawn i.i.d. from 
P, and E the expectation w.r.t. a second sample X"* = (x^, . . . , x^,,,) also drawn 
i.i.d. from P. 



In order to use this lemma one can make use of the fact that E 



N(e,T,£ 



X™ 

(X> 



) < 



N'™(e, T). The above result can be used to give a generalization error result by 
applying it to the loss-function induced class using standard techniques. Further- 
more, one can obtain bounds on the generalization error of classifiers in terms 
of the margin achieved on a training sample in terms of these covering numbers 

— see |2H|. 



4 Entropy Numbers for Kernel Machines 

In the following we will mainly consider machines where the mapping into feature 
space is defined by Mercer kernels fc(x, y) as they are easier to deal with using 
functional analytic methods. Such machines have become very popular due to 
the success of SV machines. Nonetheless in Subsection 14. ,11 we will show how a 
more direct approach could be taken towards upper-bounding entropy numbers. 



4.1 Mercer’s Theorem, Feature Spaces and Scaling 

Our goal is to make statements about the shape of the image of the input 
space X under the feature map 'P{-). We will make use of Mercer’s theorem. The 
version stated below is a special case of the theorem proven in [2111 P- 145]. In the 
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following we will assume (X, to be a finite measure space, i.e. /r(X) < oo. As 
usual, by “almost all” we mean for all elements of X” except a set of /i"-measure 
zero. 

Theorem 1 (Mercer). Suppose k G Loo(X^) such that the integral operator 

Tk : L2(X) ^ L2(X), 



Tkf {■)■■= [ ^(-,y)/(y)rfM(y) (3) 

JX 

is positive. Let ipj G L 2 PC) be the eigenfunction of associated with the eigen- 
value Xj 0 and normalized such that ||V'il|L 2 = 1 ipj denote its complex 

conjugate. Then 

1 . (A,(T)), e£i. 

2. ifj G Xoo(X) and sup^ HV'^IIloo < °o- 

3. fc(x,y) = XjTfij{x)'ipj{y) holds for almost all (x,y), where the series con- 

jGN 

verges absolutely and uniformly for almost all (x,y). 

We will call a kernel satisfying the conditions of this theorem a Mercer kernel. 
From statement 2 of Mercer’s theorem there exists some constant Ck G 
depending on k(-, j such that 

|'0i(x)l ^ C'fc for all j G N and x S X. (4) 

(Actually 0) holds only for almost all x S X, but from here on we gloss over these 

measure-theoretic niceties in the exposition.) Moreover from statement 3 it fol- 
lows that k(x,y) corresponds to a dot product in fy he. k(x,y) = (^(x), ^(y))^^ 
with 

<P:X^£2 ... 

X ^ (<PA^)h ■= (\Ai’V’i(x))j ^ 

for almost all x € X. In the following we will (without loss of generality) assume 
the sequence of (Xj)j be sorted in nonincreasing order. From the argument above 
one can see that ^(X) lives not only in £2 but in an axis parallel parallelepiped 
with lengths 2Cky/^. 

It will be useful to consider maps that map ^(X) into balls of some radius R 
centered at the origin. The following proposition shows that the class of all these 
maps is determined by elements of £2 and the sequence of eigenvalues {Xj)j. 



Proposition 1 (Mapping <?(x) into £ 2 ). Let S be the diagonal map 

S' : RN ^ KN 

Then S maps ^(X) into a ball of finite radius Rs centered at the origin if and 
only if {y/X]sj)j G £ 2 . 
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Proof. 

(<t=) Suppose {sj ^/^)j S £2 and let := || {sj ^/^)j <00. For any x S X, 

\\S<P {^)\\1 = s ■A,IV',(x)P < ^ = Rl- (7) 

jGN jGN 

Hence S<P{X) C £2- 

(=>) Suppose {sj^/Xj)j is not in £2- Hence the sequence (H„)„ with A„ := 

n 

^ s|Aj is unbounded. Now define 



a„(x) := ^s^A^-|'0i(x)p. 
i=i 



( 8 ) 



Then ||an(-)l|Li(X) = due to the normalization condition on ifj. However, as 
/r(X) < 00 there exists a set X of nonzero measure such that 



a„(x) > 




for all X e X. 



(9) 



Combining the left side of (0 with 0 we obtain ||S'^(x)||^^ > a„(x) for all 
n S N and almost all x. Since a„(x) is unbounded for a set X with nonzero 
measure in X, we can see that S'P{‘X) <f_ £2- B 



The consequence of this result is that there exists no axis parallel ellipsoid £ 
not completely containing the (also) axis parallel parallelepiped “B of sidelength 
{ 2 Cky/^)j, such that £ would contain <?(X). More formally 

“B C £ if and only if <?(X) C £. 



Hence <^(X) contains a set of nonzero measure of elements near the corners of 
the parallelepiped. 

Once we know that ^(X) “fills” the parallelepiped described above we can use 
this result to construct an inverse mapping A from the unit ball in £2 to an 
ellipsoid £ such that <?(X) C £ as in the following diagram. 



X -TO (10) 

n 

£ 

The operator A will be useful for computing the entropy numbers of concatena- 
tions of operators. (Knowing the inverse will allow us to compute the forward 
operator, and that can be used to bound the covering numbers of the class of 
functions, as shown in the next subsection.) We thus seek an operator A: £2 ^ £2 
such that 




TOJ c £. 



( 11 ) 
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We can ensure this by constructing A such that 

i^A<ljXj)j ( 12 ) 

with Ra ■= Ck\\{^/^/aj)j\\l 2 ■ From Proposition Q] it follows that all those op- 
erators A for which Ra < oo will satisfy II I I |l . We call such scaling (inverse) 
operators admissible. 



4.2 Entropy Numbers 

The next step is to compute the entropy numbers of the operator A and use this 
to obtain bounds on the entropy numbers for kernel machines like SV machines. 
We will make use of the following theorem due to Gordon, Konig and Schiitt 
ITm p. 226] (stated in the present form in 0, p. 17]). 

Theorem 2. Let cti > (72 >•••> cr^- >•••> 0 he a non-increasing sequence of 
non-negative numbers and let 



Dx = {aiXi,a2X2, ■ ■ ■ , <XjXj , . . .) (13) 

for X = (xi,X 2 , ■ ■ ■ ,Xj, . . .) G £p be the diagonal operator from ip into itself, 
generated by the sequence where 1 < p < oo. Then for all n S N, 

sup n~ T {(Ji(j 2 ■■■ o-j)^ < e„(Zl) < 6 sup n~3((T ••• (Tj)'. (14) 

jGN jGN 

We can exploit the freedom in choosing A to minimize an entropy number as 
the following corollary shows. This will be a key ingredient of our calculation of 
the covering numbers for SV classes, as shown below. 



Corollary 1 (Entropy numbers for <P{X)). Let k:X x X ^ M. be a Mercer 
kernel and let A be defined by ([^. Then 



£n(A: £2 ^ £ 2 ) < inf sup6Cfc 

(a,)s:(VXf/a,)^G^2 iSN 




(15) 



This result follows immediately by identifying D and A and exploiting the free- 
dom that we still have in choosing a particular operator A among the class of 
admissible ones. 

As already described in Section ^ the hypotheses that a S V machine generates 
can be expressed as (w,x) -|- 6 where both w and x are defined in the feature 
space 8 = span(^(X)) and 6 G R. The kernel trick as introduced by Q was then 
successfully employed in and im to extend the Optimal Margin Hyperplane 
classifier to what is now known as the SV machine. (The “-I-6” term is readily 
dealt with; we omit such considerations here though.) Consider the class 



:= {(w,x):x G 8, Uw]] < i?w} C M®. 



Note that depends implicitly on k since 8 does. 
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What we seek are the covering numbers for the class induced by the 
kernel in terms of the parameter which is the inverse of the size of the margin 
in feature space, or equivalently, the size of the weight vector in feature space 
as defined by the dot product in § (see for details). In the following we 

will call such hypothesis classes with length constraint on the weight vectors in 
feature space SV classes. Let T be the operator T = Sj^^R^r where G ® and 
the operator is defined by 



^ C 

: w ((xi,w), . , (x„,w)) . 



(16) 



with Xj € ^(X) for all J. The following theorem is useful when computing entropy 
numbers in terms of T and A. It is originally due to Maurey, and was extended 
by Carl See PS] for some extensions and historical remarks. 

Theorem 3 (Carl and Stephani p], p. 246]). Let S G where H is 

a Hilbert space. Then there exists a constant c > 0 such that for all m G N, and 
1 < j < m 

en{S) < ells'll (n~^ log (l + ^)) ^ ■ 

The restatement of Theorem |3 in terms of e 2 n-i = e„ will be useful in the 
following. Under the assumptions above we have 

e„(S) <c||S|| (^(logn+l)~Mog ^1+ . (17) 

Now we can combine the bounds on entropy numbers of A and Sx'^ to obtain 
bounds for SV classes. First we need the following lemma. 

Lemma 2 (Carl and Stephani p. 11]). Let E,F,G be Banach spaces, 
R G £(U, G), and S G £(if, F). Then, for n,t gN, 

entiRS) < en{R)et{S) (18) 

en(i?S)<e„(i?)||S|| (19) 

e„(i?S)<e„(S)|li?||. (20) 



Note that the latter two inequalities follow directly from the fact thatci{R) = ||i?| 
for allRGZ{F,G). 



Theorem 4 (Bounds for SV classes). Let k be a Mercer kernel, let <P be 
induced via m (XTid T , — ’whiGTG ‘IS yxvC'Ti by (XTid € IK. 

Let A be defined by dH) and suppose Xj = ^(xj) for j = l,...,m. Then the 
entropy numbers of T satisfy the following inequalities: 

en{T) < c\\A\\R^\og-^/^ n\og-^/^ ( 1 + 7 ^^) ( 21 ) 

en(T) < i?w£n(Gl) (22) 

ent(T) < ci?wlog-i/2nlog-i/2 (1 + 7^^) 

where Gk and c are defined as in Cnrnlla.ru 171 a.nd Lemma 0 
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This result gives several options for bounding e„(T). The reason for using e„ 
instead of e„ is that the index only may be integer in the former case (whereas 
it can be in [1, oo) in the latter), thus making it easier to obtain tighter bounds. 
We shall see in examples later that the best inequality to use depends on the rate 
of decay of the eigenvalues of k. The result gives effective bounds on 
since 

e„(T:£2^C)<eo ^ < n. 

Proof. We will use the following factorization of T to upper bound e„(T). 

Ut, ^ C (23) 




The top left part of the diagram follows from the definition of T. The fact that 
remainder commutes stems from the fact that since A is diagonal, it is self-adjoint 
and so 

(w,x) = (w, = (Aw, A“^x). (24) 

Instead of computing the covering number of T = directly, which is 

difficult or wasteful, as the the bound on does not take into account that 
X G £ but just makes the assumption of x G pUe^ for some p > 0, we will 
represent T as Ai?w This is more efficient as we constructed A such 

that <P{X)A~^ G C /^2 filling a larger proportion of it than just i^(X). 

By construction of A and the Cauchy-Schwarz inequality we have || = 1- 

Thus applying lemma El to the factorization of T and using Theorem El proves 
the theorem. ■ 

One can give (see below) asymptotic rates of decay for e„(A). (In fact we can 
determine non-asymptotic results with explicitly evaluable constants.) It is thus 
of some interest to give overall asymptotic rates of decay of e„(T) in terms of 
the order of e„(A). 

Lemma 3 (Rate bounds on e„). Let k be a Mercer kernel and suppose A is 
the scaling operator associated with it as defined by 

f • If £ri(A) = 0(log~“ n) for some a > 0 then e„(T) = 0(log~*-“~''^^ n). 

2. //loge„(A) = 0(log“^n) for some /? > 0 then loge„(T) = 0(log“^n). 

This Lemma (the proof of which is omitted; see PSj) shows that in the first 
case, Maurey’s result (theorem EJ allows an improvement in the exponent of the 
entropy number of T, whereas in the second, it affords none (since the entropy 
numbers decay so fast anyway). The Maurey result may still help in that case 



Entropy Numbers, Operators and Support Vector Kernels 295 



though for nonasymptotic n. In a nutshell we can always obtain rates of con- 
vergence better than those due to Maurey’s theorem because we are not dealing 
with arbitrary mappings into infinite dimensional spaces. In fact, for logarithmic 
dependency of tn{T) on n, the effect of the kernel is so strong that it completely 
dominates the 1/e^ behaviour for arbitrary Hilbert spaces. An example of such 
a kernel is k{x,y) = exp(— (a; — y)^). 



4.3 Empirical Bounds 

Instead of theoretically determining the shape of '^(X) a priori one could use the 
training and/or test data to empirically estimate its shape and use this quantity 
to compute an operator Hemp analogously to (II Oil which performs the mapping 
described above. We merely flag this here — the full development of these ideas 
requires considerable further work and will be deferred to a subsequent paper. 
There are some remarks in the full version of this paper Furthermore the 
statistical argument needed to exploit such techniques (bounding generalization 
error in terms of empirical covering numbers has now been developed — see 1221. 



5 Eigenvalue Decay Rates 

The results presented above show that if one knows the eigenvalue sequence 
(Ai)i of a compact operator, one can bound its entropy numbers. A commonly 
used kernel is k{x,y) = which has noncompact support. The induced 

integral operator {Tkf){x) = k{x, y)f{y)dy then has a continuous spectrum 
and thus is not compact O p.267]. The question arises: can we make use 
of such kernels in SV machines and still obtain generalization error bounds of 
the form developed above? This problem can be readily resolved by analysing 
the w-periodic extension of the kernel in question ky{x) := Yl^=-oo ~ ^ 

simple argument gives 

Lemma 4. Let k:M. 'M. be a symmetric convolution kernel, let K(uj) = F[k{x)](uj) 
denote the Fourier transform of k{-) and ky denote the v-periodic kernel de- 
rived from k (also assume that ky exists). Then ky has a representation as a 

Fourier series with loq '■= ^ and ky{x — y) = ^ Moreover 

j=-oo 

Xj = V27rA:(jwo) for j GZ and Ck = 

This lemma tells one how to compute the discrete eigenvalue sequence for kernels 
with infinite support; for more details see 

The above results show the overall covering numbers of a SV machine are con- 
trolled by the entropy numbers of the admissible scaling operator A: e„(A: ^2 
£ 2 ). One can work this out (with constants), although it is somewhat intricate to 
do so. Here we simply state how e„(A) depends asymptotically on the eigenvalues 
of Tk for a certain class of kernels. 
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Proposition 2 (Exponential Polynomial decay). Suppose k is a Mercer 
kernel with Xj = for some a,/3,p > 0. Then lne“^(A :£2 ^ ^ 2 ) = 

0(lnp+T n) 

An example of such a kernel (for p = 2) is k{x) = . It can also be shown 

that the rate in the above proposition is asymptotically tight. For a proof, and 
related results, see | 25 |. 

6 The Missing Pieces and Some Conclusions 

In this short version we have omitted many details and extensions such as 

Discretization How should one choose v in periodizing a non-compact kernel? 
Higher Dimensions The results need to be extended to multi-dimensional k- 
ernels to be practically useful. Several additional technical complications 
arise in doing so. 

Glueing it all Together We have given the ingredients but not baked the 
cake. Since the approach we have taken is new, and since there are a wide 
range of different uniform convergence results one may use we have refrained 
from putting it all together into “master generalization error theorem.” It 
should be clear that it is possible to do so. 

Combining all these pieces together does give an (albeit complicated) answer 
to the question “what is the effect of the kernel?” Different kernels, or even 
different widths of the same kernel, give rise to different covering numbers and 
hence different generalization performance. We hope eventually to be able to 
give simple rules of thumb concerning the overall effect. The mere fact that 
entropy number techniques provide a handle on the question is interesting in 
itself though. 

In summary, we have shown how to connect properties known about mappings 
into feature spaces with bounds on the covering numbers. Our reasoning relied 
on the fact that this mapping exhibits certain decay properties to ensure rapid 
convergence and a constraint on the size of the weight vector in feature space. 
This means that the corresponding algorithms have to restrict exactly this quan- 
tity to ensure good generalization performance. This is exactly what is done in 
Support Vector machines. The method used to obtain the results (reasoning vi- 
a entropy numbers of operators) would seem to be a nice new viewpoint and 
valuable for other problems. 
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Fig. 1. Schematic picture of the new viewpoint. 
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